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Abstract 


We study the structure and learnability of sums of independent integer random variables 
(SIIRVs). For k £ Z+, a k-SIIRV of order n £ Z + is the probability distribution of the sum of 
n mutually independent random variables each supported on {0,1,...,A; — 1}. We denote by 
<S n ,fc the set of all fc-SIIRVs of order n. 

How many samples are required to learn an arbitrary distribution in £„.*,? In this paper, 
we tightly characterize the sample and computational complexity of this problem. More pre¬ 
cisely, we design a computationally efficient algorithm that uses 0(k/e 2 ) samples, and learns 
an arbitrary fc-SIIRV within error e, in total variation distance. Moreover, we show that the 
optimal sample complexity of this learning problem is 0((fc/e 2 )y4og(l/e)), i.e., we prove an 
upper bound and a matching information-theoretic lower bound. Our algorithm proceeds by 
learning the Fourier transform of the target fc-SIIRV in its effective support. Its correctness 
relies on the approximate sparsity of the Fourier transform of fc-SIIRVs - a structural property 
that we establish, roughly stating that the Fourier transform of fc-SIIRVs has small magnitude 
outside a small set. 

Along the way we prove several new structural results about fc-SIIRVs. As one of our main 
structural contributions, we give an efficient algorithm to construct a sparse proper e-cover for 
<S n ,fc, in total variation distance. We also obtain a novel geometric characterization of the space 
of fc-SIIRVs. Our characterization allows us to prove a tight lower bound on the size of e-covers 
for S n ,k ~ establishing that our cover upper bound is optimal - and is the key ingredient in our 
tight sample complexity lower bound. 

Our approach of exploiting the sparsity of the Fourier transform in distribution learning is 
general, and has recently found additional applications. In a subsequent work |DKS15a] , we use 
a generalization of this idea (in higher dimensions) to obtain the first efficient learning algorithm 
for Poisson multinomial distributions. In jDKSlbb] , we build on this approach to obtain the 
fastest known proper learning algorithm for Poisson binomial distributions (2-SIIRVs). 
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1 Introduction 

1.1 Motivation and Background We study sums of independent integer random variables: 
Definition. For k € Z+, a k-IRV is any random variable supported on {0, 1, ..., k— 1}. A k-SIIRV 
of order n is any random variable X = Xi where the Xfs are independent /c-IRVs. We will 

denote by S Uy k the set of probability distributions of all fc-SIIRVs of order to. 

For convenience, throughout this paper, we will often blur the distinction between a random 
variable and its distribution. In particular, we will use the term /c-SIIRV for the random variable 
or its corresponding distribution, and the distinction will be clear from the context. 

Sums of independent integer random variables (SIIRVs) comprise a rich class of distributions 
that arise in many settings. The special case of k = 2, <S n> 2 , was first considered by Poisson (Poi37| as 
a non-trivial extension of the Binomial distribution, and is known as Poisson binomial distribution 
(PBD). In application domains, SIIRVs have many uses in research areas such as survey sampling, 
case-control studies, and survival analysis, see e.g., [CL97 ] for a survey of the many practical uses 
of these distributions. We remark that these distributions are of fundamental interest and have 
been extensively studied in probability and statistics. For example, tail bounds on SIIRVs form an 
important special case of Chernoff/Hoeffding bounds |Che52l iHoeb.T lDP09bj . Moreover, there is 
a long line of research on approximate limit theorems for SIIRVs, dating back several decades (see 
e.g., [Pre831 lKru861 IBH.T92] ). and [CL10( ICGSllj for some recent results. 


Structure and Learning of fc-SIIRVs. The main motivation of this work was the problem of 
learning an unknown fc-SIIRV given access to independent samples. Understanding this problem 
is intimately related to obtaining a refined structural understanding of the space of fc-SIIRVs. The 
connection between structure and distribution learning is the main thrust of this paper. 

Distribution learning or density estimation is the following task |DG85llKMR + 94 IDLOlj : Given 
independent samples from an unknown distribution P in a family V, and an error tolerance e > 0, 
output a hypothesis H such that with high probability the total variation distance (Pry (H, P) is at 
most e. The sample and computational complexity of this unsupervised learning problem depends 
on the structure of the underlying family T>. The main goals here are: (i) to characterize the sample 
complexity of the learning problem, i.e., to obtain matching information-theoretic upper and lower 
bounds, and (ii) to design a computationally efficient learning algorithm - i.e., an algorithm whose 
running time is polynomial in the sample (input) size - that uses an information-theoretically 
optimal sample size. 

While density estimation has been studied for several decades, the number of samples required 
to learn is not yet well understood, even for surprisingly simple and natural classes of univariate 
discrete distributions. More specifically, there is no known complexity measure of a distribution 
family T> that characterizes the sample complexity of learning an unknown distribution from T>. 
In contrast, the VC dimension of a concept class plays such a role in the PAC model of learning 
Boolean functions (see, e.g, IBEHW89. lKV94j h 

We remark that the classical information-theoretic quantity of the metric entropy jvdVW96l 
IDLOli Tsy08| , i.e., the logarithm of the size of the smallest e-covei0of the distribution class, provides 
an upper bound on the sample complexity of learning. Alas, this upper bound is suboptimal in 
general - both quantitatively and qualitatively - and in particular for the class of P-SIIRVs, as we 
show in this paper. 


1 Formally, a subset 5 f C B in a metric space (~D, d ) is said to be an e-cover of T> with respect to the metric 

d : X 1 2 —y R+, if for every x £T> there exists some y 6 T> t such that d(x, y) < e. In this paper, we focus on the total 
variation distance between distributions. 
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Obtaining a computationally efficient learning algorithm with optimal (or near-optimal) sample 
complexity is an important goal. In many learning settings, achieving this goal turns out to be 
quite challenging. More specifically, in many scenarios, both supervised and unsupervised, the 
only computationally efficient learning algorithms known use a (provably) suboptimal sample size. 
Intuitively, increasing the sample size (e.g., by a polynomial factor) can make the algorithmic task 
substantially easier. Characterizing the tradeoff between sample complexity and computational 
complexity is of fundamental importance in learning theory. In this work, we essentially characterize 
this tradeoff for the unsupervised problem of learning SIIRVs. 

1.2 Our Results The main technical contribution of this paper is the use of Fourier analytic 
and geometric tools to obtain a refined structural understanding of the space of fe-SIIRVs. As a 
byproduct of our techniques, we characterize the sample complexity of learning fc-SIIRVs (up to 
constant factors), and moreover we obtain a computationally efficient learning algorithm with near- 
optimal sample complexity. Our results answer the main open questions of jDDS12bl |PDO + 13 . 

Along the way we prove several new structural results of independent interest about /c-SIIRVs, 
including: the approximate sparsity of their Fourier transform; tight upper and lower bounds on e- 
covers (in total variation distance and Kolmogorov distance); and a novel geometric characterization 
of the space of £;-SIIRVs, that is crucial for our sample complexity lower bound. Below, we state 
our results in detail and elaborate on their context and the connections between them. 

Learning SIIRVs via the Fourier Transform. As our first result, we give a sample near- 
optimal and computationally efficient learning algorithm for /c-SIIRVs: 

Theorem 1.1 (Nearly Optimal Learning of /c-SIIRVs). There is a learning algorithm for k-SIIRVs 
with the following performance guarantee: Let P be any k-SIIRV of order n. The algorithm uses 
0(k/e 2 ) samples from P, runs in tim \ 1 0(k 3 /e 2 ), and with probability at least 2/3 outputs a (suc¬ 
cinct description of a) hypothesis H such that dx\/(H,P) < e. 

Our algorithm outputs a succinct description of the hypothesis H, via its Piscrete Fourier 
Transform (PFT) H, which is supported on a set of small cardinality. The PFT immediately gives 
a fast evaluation oracle for H. We also show how to use the PFT, in a black-box manner, to obtain 
an efficient approximate sampler for the target distribution P. Our efficient learning algorithm is 
described in Section I2T1 In Section 12.31 we give the efficient construction of our sampler. 

We remark that the sample complexity of our algorithm is optimal up to logarithmic factors. 
Indeed, even learning a single /e-IRV to variation distance e requires Ll(k/e 2 ) samples. For the 
case of k = 2, |PPS12bj gave a learning algorithm that uses 0(l/e 2 ) samples, but runs in quasi¬ 
polynomial time, namely (l/e) polylog ( 1//e ). More recently, [PPO + 13] studied the case of general k , 
and gave an algorithm that uses poly (/c/e) samples and time. Notably, the degree of this polynomial 
is quite high: the sample complexity of the [PPO + 13| algorithm is fl(/c 9 /e 6 ). Theorem 11.11 gives a 
nearly-tight upper bound on the sample complexity of this learning problem, and does so with a 
computationally efficient algorithm. 

Given our 0(k/e 2 ) sample upper bound, it would be tempting to conjecture that 0(/c/e 2 ) is in 
fact the optimal sample complexity of learning /c-SIIRVs. If true, this would imply that learning a 
A:-SIIRV is as easy as learning a /c-IRV. Surprisingly, we show that this is not the case: 


2 We work in the standard “word RAM” model in which basic arithmetic operations on 0(logn)-bit integers are 

assumed to take constant time. 
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Theorem 1.2 (Optimal Sample Complexity). For any k € Z+, e < 1/poly(fc), t/iere is an algorithm 
that learns k-SIIRVs within variation distance e using 0((k/e 2 )ff'log( 1/e)) samples. Moreover, any 
algorithm for this problem information-theoretically requires Pl((k/e 2 )^J\og(l/e)) samples. 

Theorem II.21 precisely characterizes the sample complexity of learning /c-SIIRVs (up to constant 
factors) by giving an upper bound and a matching information-theoretic sample lower bound. 
The sharp sample complexity bound of 0((fc/e 2 )y / log(l/e)) is surprising, and cannot be obtained 
using standard information-theoretic tools (e.g., metric entropy). We elaborate on this issue in 
Section 11.41 

We remark that the upper bound of Theorem 1 1.2 1 does not specify the running time of the corre¬ 
sponding algorithm. This is because the simplest such algorithm actually runs in time exponential 
in k. For the important special case of k = 2, we obtain a sample-optimal learning algorithm that 
runs in sample-linear time: 

Theorem 1.3 (Optimal Learning of PBDs (2-SIIRVs)). For any e > 0, there is an algorithm that 
learns PBDs within variation distance e using 0((l/e 2 )i/log(l/e)) samples and running in time 

0((l/e 2 ) v / tog(lA))- 

The upper bound of Theorem 11.21 and Theorem 11.31 are established in Section 12.41 Our tight 
sample complexity lower bound is proved in Section [5l 

Using the Fourier Transform for Distribution Learning. Our learning upper bounds are 
obtained via an approach which is novel in this context. Specifically, we show that the Fourier 
transform of fc-SIIRVs is approximately sparse , and exploit this property to learn the distribution 
via learning its Fourier transform in its effective support. The sparsity of the Fourier transform 
explains why this family of distributions is learnable with sample complexity independent of n, and 
moreover it yields the sharp sample-complexity bound. The algorithmic idea of exploiting Fourier 
sparsity for distribution learning is general (see Section 12.21) , and was subsequently used by the 
authors in other related settings {DKS15a[ IDKS15b] . 

Structure of /c-SIIRVs. Our core structural result is the following simple property of the Fourier 
transform of /c-SIIRVs: 

Any k-SIIRV with “large” variance has a Fourier transform with “small” effective support. 

One can obtain different versions of the above informal statement depending on the setting and 
the desired application. See Lemma 12.31 for a formal statement in the context of the DFT. The 
Fourier sparsity of &-SIIRVs forms the basis for our upper bounds in this paper. As previously 
mentioned, this structural property motivates and enables our learning algorithm. Moreover, it is 
useful in order to obtain sparse e-covers for S nt k, the space of /c-SIIRVs, under the total variation 
distance. 

More specifically, using the approximate sparsity of the Fourier transform of SIIRVs combined 
with analytic arguments, we obtain a computationally efficient algorithm to construct a proper 
e-cover for S n of near-minimum size. In particular, we show: 

Theorem 1.4 (Optimal Covers for fc-SIIRVs). For e < 1/k, there exists a proper e-cover S n C 
S H: k of S Uj k under the total variation distance of size < n ■ (l/e) 0 ^ felog ^ 1//e ^ that can be 

constructed in polynomial time. 
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The best previous upper bound on the cover size of 2-SIIRVs is n 2 + n ■ (l/e)°^ log A/A) [DP09a[ 
IDP14| . For k > 2, jDDO + 13] gives a non-proper cover of size n ■ 2 poly ( fc / e ). 

Our proper cover upper bound construction provides a smaller search space for essentially any 
optimization problem over £>SIIRVs. Specifically, Theorem 11.41 has the following implication in 
computational game theory: Via a connection established in |DP071 IDP09aj . the proper cover 
construction of Theorem 11.41 ffor k = 2) yields an improved poly(n) • (1/e) c> ( lo s( 1 / e )) time algorithm 
for computing e-Nash equilibria in anonymous games with 2 strategies per player. Our matching 
lower bound on the cover size implies that the “cover-based approach” cannot lead to an FPTAS 
for this problem. We note that computing an (exact) Nash equilibrium in an anonymous game with 
a constant number of strategies was recently shown to be intractable |CDQ15j . Our cover upper 
bound is proved in Section [3l 

We also prove a matching lower bound on the cover size, showing that our above construction 
is essentially optimal: 


Theorem 1.5 (Cover Size Lower Bound for £;-SIIRVs). For e < l/poly(fc), and n = fl(log(l/e)), 
any e-cover for S n ^ has size at least n ■ (l/e)^ 10 ®^ 1 / 6 ^. 

Before our work, no non-trivial lower bound on the cover size was known. We view the inherent 
quasi-polynomial dependence on 1/e of the cover size established here as a rather surprising fact. 
Our cover size lower bound proof relies on a new geometric characterization of the space of &-SIIRVs 
that we believe is of independent interest, and may find other applications. Our tight lower bound 
on the sample complexity of learning fc-SIIRVs relies critically on this characterization. Our cover 
size lower bound is proved in Section [2 


1.3 Preliminaries We record a few definitions that will be used throughout this paper. 

def 

Distributions and Metrics. For m £ Z+, we denote [m] = {0,1,... , m}. A function P : A —>• 
R, over a finite set A, is called a distribution if P(o) > 0 for all a € A, and XXeAP( a ) = 1- The 
function P is called a pseudo-distribution if XXeA P( a ) = 1- F° r a pseudo-distribution P over [m], 
m £ Z_|_, we write P(i) to denote the value Pr^~p[A' = i] of the probability density function (pdf) 
at point i, and P(< i) to denote the value Prx~p[V < i] of the cumulative density function (cdf) 
at point i. For S C [n], we write P(S') to denote X^eS-^W' 

The total variation distance between two (pseudo-)distributions P and Q supported on a finite 

set A is drv (Pi Q) = nraxscA IP(S') — Q(<S') | = (1/2) • ||P — Q|| i- Similarly, if X and Y are random 
variables, their total variation distance dTy(X, Y ) is defined as the total variation distance between 
their distributions. Another useful notion of distance between distributions/random variables is the 

def 

Kolmogorov distance, defined as dy (PiQ) = sup xeK |P(< x) — Q(< x)| . Note that for any pair of 
distributions P and Q supported on a finite subset of R we have that dy (PiQ) < d'YV (Pj Q). 

Distribution Learning. Since we are interested in the computational complexity of distribution 
learning, our algorithms will need to use a succinct description of their hypotheses. A simple 
succinct representation of a discrete distribution is via an evaluation oracle for the probability 
mass function. For e > 0, an e-evaluation oracle for a distribution P over [m] is a polynomial 
size circuit C with O(logm) input bits such that for each input z, the output of the circuit C(z) 
equals the binary representation of the probability P '{z), for some pseudo-distribution P' which 
has Rtv^P 7 ,P) < 6. Another general way to succinctly specify a distribution is to give the code 
of an efficient algorithm that takes “pure” randomness and transforms it into a sample from the 
distribution. This is the standard notion of a sampler. An e-sampler for P is a circuit C with 
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0(log m + log(l/e)) input bits z and O(logm) output bits y which is such that when 2 : ~ U m , then 
y ~ P', for some distribution P' which has d'iv(P / ! P) < e. 

We emphasize that our learning algorithms output both an e-sampler and an e-evaluation oracle 
for the target distribution. 

Covers. Let T be a family of probability distributions. Given 5 > 0, a subset Q C T is said to be 
a proper 5-cover of T with respect to the metric d{ ■, •) if for every distribution P G J 7 there exists 
some Q G Q such that d(P, Q) < 5. If Q is not a subset of J 7 , then the cover is called non-proper. 
The 5-covering number for (J 7 , d ) is the minimum cardinality of a 5-cover. The 5-packing number 
for (J 7 , d) is the maximum number of points (distributions) in T at pairwise distance at least 5 
from each other. 

1.4 Our Approach and Techniques The unifying idea of this work is an analysis of the 
structure of the Fourier Transform (FT) of /c-SIIRVs. The FT is a natural tool to consider in this 
context. Recall that the FT of a sum of independent random variables is the product of the FT’s 
of the individual variables. Moreover, if two random variables have similar FT’s, they also have 
similar distributions. These two basic facts are the starting point of our analysis. We now provide 
an overview of the ideas underlying our results, and give a comparison to previous techniques. 

Discussion & Previous Approaches for Learning SIIRVs. Let T> be a family of distributions 
over a domain of size N. How many samples are required to learn an arbitrary P € T> within 
variation distance e? Without any restrictions on T> , it is a folklore fact that the sample complexity 
learning is 0(IV/e 2 ). The optimal learning algorithm in this case is the obvious one: output the 
empirical distribution. By exploiting the structure of the family T> , one may obtain better results. 

A very natural type of structure to consider is some sort of “shape constraint” on the probability 
density function, such as log-concavity or unimodality. There is a long line of work in statistics on 
this topic (see, e.g., the books |BBBB72l IGJ14j h and more recently in TCS |DDS12al ICDSS14al 
ICPSS14bl IAPLS15] . Alas, it turns out that /c-SIIRVs do not satisfy any of the shape constraints 
considered in the literature (see jDDO + 13] for a discussion). 

A different type of structure, based on the notion of metric entropy [ Yat85l lBir86l iPL01| . yields 
the following implication: If a distribution class V has an e/2-cover of size M, then it is learnable 
with OflogM/e 2 ) sanrples|l In a celebrated paper in information theory YB99| . Yang and Barron 
show that, for broad families of (continuous) distributions, the metric entropy characterizes the 
sample complexity of learning. For /c-SIIRVs, however, this is not the case: Via Theorem 11.41 the 
metric entropy method implies a sample upper bound of 0((l/e 2 ) • logn + ( k/e 2 ) • log 2 (l/e)). Note 
that, since our cover size upper bound is tight, this sample bound is the limit of the metric entropy 
method for /c-SIIRVs. Thus, this method gives a suboptimal sample upper bound for our learning 
problem, both qualitatively (dependence on n), and quantitatively (dependence on e). 

Previous work on learning /c-SIIRVs [PPS12bI lDDO + 13j relies on a certain “regularity” lemma 
about the structure of these distributions: Any /c-SIIRV is either e-close in total variation distance 
to being L = 0(/c 9 /e 4 )- “sparse”, i.e., it is supported on a set of at most L consecutive integers, or 
e-close to being “Gaussian like”. In the former case, the distribution can be learned using 0(L/e 2 ) 
samples, and in the latter case one can exploit the Gaussian structure to learn with a small number 
of samples as well. Unfortunately, the sparse case is a bottleneck for this approach, as any algorithm 
to learn a istribution over support L requires P(L/e 2 ) samples. Hence, one needs to exploit the 
structure of /c-SIIRVs beyond the aforementioned. 

3 We remark that the running time of this method is f2(M/e 2 ), which is not necessarily polynomial in the sample 

size. 
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Our Learning Approach. In this paper, we depart from the aforementioned approaches. We 
identify a simple condition - the approximate sparsity of the Fourier transform - as the “right” 
property that determines the sample complexity of our learning problem. The Fourier sparsity 
explains why the sample complexity of learning USIIRVs is independent of n, and allows us to 
obtain the sharp sample bound as a function of both k and e. We show that this is a more general 
phenomenon (see Theorem 12.51 in Section 12.211 : any univariate distribution that has an s-sparse 
Fourier transform, in a certain well-defined technical sense, is learnable with 0(s/e 2 ) samples. 

Our computationally efficient learning algorithm proceeds as follows: It starts by drawing an 
initial set of samples to determine the effective support of the target distribution and its Fourier 
transform. This is achieved by estimating the mean and variance of our SIIRV. We remark that, 
for computational purposes, our algorithm uses the Discrete Fourier Transform (DFT). For the 
appropriate definition of the DFT, we show (Lemma l2.3l) there exists an explicit set S of cardinality 
|S'! = 0(k 2 log(k/e)) that contains all the “heavy” Fourier coefficient^. Our algorithm then draws 
an additional set of samples to estimate the DFT of the target distribution at the points of the 
effective support S, and sets the DFT to 0 everywhere else. By exploiting the sparsity in the 
Fourier domain, we show that the inverse of the empirical DFT achieves total variation distance 
e/2 after 0(k/e 2 ) samples. Note that an explicit description of an accurate hypothesis for our 
learning problem can have an effective support of size Q(ky/n). While we can easily obtain such a 
description (by explicitly computing the inverse DFT), this would not lead to a computationally 
efficient algorithm. We instead output a succinct description of our hypothesis (in time that is 
independent of n). In particular, our algorithm outputs the empirical DFT at the points of its 
effective support. Our learning algorithm is given in Section f2.ll 

We emphasize that the implicit description of the hypothesis H, via its DFT H, is sufficient to 
obtain both an approximate evaluation oracle and an approximate sampler for the target fc-SIIRV 
P. Obtaining an approximate evaluation oracle is straightforward: Since H is supported on the set 
S. we can compute H(i) in time 0(|<S , |). To obtain an efficient sampler, we proceed in two steps: 
We first show how to efficiently compute the CDF of H, using oracle access to the the DFT H. To 
do this, we express the value of the CDF at any point via a closed form expression involving the 
values of H. Given oracle access to the CDF, we use a simple binary search procedure to sample 
from a distribution Q satisfying c2tv(Q 5 H) < e/2. Our sampler is given in Section [2T3l 

Finally, we note that our above-described Fourier-learning algorithm achieves a near-optimal 
sample complexity (up to logarithmic factors). The basic idea to obtain the optimal sample com¬ 
plexity is to smoothly mollify the DFT instead of truncating it. This removes some artifacts caused 
by a sharp truncation and yields a hypothesis whose error from the true distribution decays rapidly 
as we move away from the mean. Our sample-optimal upper bound is established in Section 12.41 

Cover Upper Bound. We start by commenting on previous approaches for proving cover upper 
bounds in this context. The main technique for the 2-SIIRV cover upper bound of [DP09a| is the 
following lemma (that is deduced in [DP09a| using a result from jRooOOj l: If two 2-SIIRVs agree on 
their first D(log(l/e)) moments, then their total variation distance is at most e. First, we show that 
this moment-matching lemma is quantitatively tight: we give an example of two 2-SIIRVs over k+1 
variables that agree on the first k moments and have variation distance (Proposition IB. ill . 

We emphasize however that such a moment-matching technique cannot be generalized to k- 
SIIRVs, even for k = 3. Intuitively, this is because knowledge about moments fails to account for 
potential periodic structure of the probability mass function that comes into play for k > 2. For 

4 We moreover show that there exists a set of cardinality 0(k log(fc/e)) that contains all the “heavy” Fourier 
coefficients, alas this smaller set is not explicitly known a priori. 
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example, 0(n) moments do not suffice to distinguish between the cases that a 3-SIIRV of order to is 
supported on the even versus the odd integers. More specifically, in Proposition [R2] (Appendix [B]), 
we give an explicit example of two 3-SIIRVs of order to/ 2 that agree exactly on the first to — 1 
moments and have disjoint supports. 

In conclusion, moment-based approaches fail to detect periodic structure. On the other hand, 
this type of structure is easily detectable by considering the Fourier transform. Our cover upper 
bound hinges on showing that the Fourier transform of a /c-SIIRV is necessarily of low complexity, 
i.e., it can be succinctly described up to small error. In particular, since the Fourier transform 
is smooth, we show (Lemma 13. 6 j) . roughly, that its logarithm can be well approximated by a low 
degree Taylor polynomial on intervals of length 0(1/ k). (Our actual statement is somewhat more 
complicated as it needs to account for roots of the Fourier transform close to the unit circle.) 
Therefore, providing approximations to the low-degree Taylor coefficients of the logarithm of the 
Fourier transform provides a concise approximate description of the distribution. 

Cover Lower Bound & Sample Lower Bound. Our lower bounds take a geometric view of 
the problem. At a high-level, we consider the function that maps the set of n(k — 1) parameters 
defining a fc-SIIRV to the corresponding probability mass function. We show that there exists 
a region of the space of distributions where this function is locally invertible. For k = 2, we in 
fact show that the distribution of any 2-SIIRV with distinct parameters lies in the interior of this 
region. This structural understanding allows us to use certain appropriately defined expectations to 
extract the effect of individual parameters on the distribution. In addition, for to = @(log(l/e)), we 
show that near a particular /c-SIIRV not only is the map from parameters to distribution locally a 
bijection, but that this map is actually surjective onto a ball of reasonable size. In other words, near 
this particular distribution, the £l(k log(l/e)) parameters of the output distribution are effectively 
independent, which intuitively implies the (l/e)^^ fc 1 °s( 1 / e )) lower bound on the cover size. 

To prove our sample lower bound, at a high-level, we combine the aforementioned geometric 
understanding with Assouad’s lemma |Ass83j . We note that one might naively expect that such 
a situation would lead to a lower bound of Q(kl og(l/e)/e 2 ), but since the distributions under 
consideration have additional structure, it turns out that the best lower bound that can be obtained 
is n(k^/l og(l/e)/e 2 ). 


1.5 Related Work Density estimation is a classical topic in statistics and machine learning with 
a rich history and extensive literature (see e.g., [BBBB721IDG851 [Sil861 ISco921 IDLOl] h The reader 
is referred to jlze m for a survey of statistical techniques in this context. In recent years, a large 
body of work in TCS has been studying these questions from a computational perspective; see 


e.g. 


[KMR+941 IFM991 IAKf)11 ICCG021IVW021IFOS951 [BSTol IKMVIOt IMV101 IDPS12at IDDS12bl 


lDDO+131 ICDSS14al ICDSSlibl lADLSIBj . 


Covering numbers (and their logarithms, known as metric entropy numbers) were first defined 
by A. N. Kolmogorov in the 1950’s and have since played a central role in a number of areas, in¬ 
cluding approximation theory, geometric functional analysis (see, e.g., [Dud741 lMak86l IBOLQ7| and 
the books [KT591 ILor661 ICS901 IET96j ), geometric approximation algorithms |Hpll| , information 
theory, statistics, and machine learning (see, e.g., [Yat85l lBir86l lHI90l IHQ971 IYB99(IGS13] and the 
books |vdVW96l IDL011 [Tsy08| ). 


Concurrent Work. Concurrent work by Daskalakis et al. |DKT15| , using different techniques, 
gives upper bounds on the learning sample complexity of Poisson Multinomial Distributions (PMDs). 
While upper bounds on the sample complexity of PMDs yield similar upper bounds for /e-SIIRVs, 
the implied upper bounds for fc-SIIRVs are quantitatively significantly weaker than ours. Moreover, 
the |DKT15j learning algorithm has running time exponential in k and super-polynomial in 1/e. 
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Subsequent Work. In a followup work |DKS15aj . the authors have generalized the techniques of 
this paper to the multidimensional case, namely to the family of Poisson Multinomial Distributions 
(PMDs), i.e., sums of independent random vectors supported over the standard basis in We note 

that the results of the current paper are not subsumed by the results of |DKS15a] . In particular, 
[ DKSl5aj gives an efficient learning algorithm for PMDs that uses log 0 ^(l/e)/e 2 samples, and 
proves that the optimal cover size for PMDs depends doubly exponentially on k. 

1.6 Organization In Section [2] we describe and analyze our learning algorithms for fc-SIIRVs. 
Section [3] contains our cover upper bound construction. Our cover lower bound is given in Sectional 
and our sample lower bound in Section [5j 

2 Learning SIIRVs 

In this section, we describe our algorithms for learning fc-SIIRVs. The structure of this section is 
as follows: In Section [2T1 we give our sample near-optimal and computationally efficient learning 
algorithm. As mentioned in the introduction, our algorithm outputs a succinct description of its 
hypothesis H, via its DFT. In Section 12.21 we provide a simple general algorithm that learns any 
one-dimensional discrete distribution with a sparse Fourier support. In Section 12.31 we show how 
to efficiently obtain an e-sampler for our unknown /c-SIIRV, using the DFT representation of H 
as a black-box. Finally, in Section 12.41 we present our more sophisticated Fourier-based learning 
algorithm with optimal sample complexity. 

2.1 A Computationally Efficient Sample Near-Optimal Algorithm The main result of 
this subsection is Theorem o which we state below in more detail for the sake of completeness. 

Theorem 2.1. There is an algorithm Learn-SIIRV that for any P G S n *. and e > 0, takes 
0(klog 2 (k/e)/e 2 ) samples from P, runs in time 0(k 3 /e 2 ) and returns a (succinct description of a) 
hypothesis H so that with probability at least 2/3 we have that drv^P^H) < e. 

For computational purposes, our learning algorithm in this section uses the Discrete Fourier 
Transform, which we now define. 

def 

Definition 2.2. For x € I we will denote e(x) = ex.p(—2irix). The Discrete Fourier Transform 
(DFT) modulo M of a function F : [n] —>• C is the function F : [M — 1] —>• C defined as F(£) = 
Ylj=o e (£l /M)F(j) , for integers £ € [M — 1], The DFT modulo M of a distribution P, P is the 
DFT modulo M of its probability mass function. The inverse DFT modulo M onto the range 
[m, m + M — 1] of F : [M — 1] — >• C, is the function F : [m, m + M — 1] n Z -> C defined by 
F(j) = jj Ylfjo 1 e (—£j/M)F(£) , for j G [m, m + M — 1] n Z. The L 2 norm of the DFT is defined 

^iibi2 = v/iE"o iRai 2 ' 

We start by giving an intuitive explanation of our approach. The Fourier transform Q of the 
empirical distribution Q provides an approximation to the Fourier transform P of P. In particular, 
if we take N samples from P, we expect that the empirical Fourier transform Q has error 0(N~ 1 / 2 ) 
at each point. This implies that the expected L 2 error ||Q — P ||2 is 0{N~ 1//2 ), and thus by applying 
the inverse Fourier transform, would yield a distribution with L 2 error of 0(N ~ 1//2 ) from P. This 
guarantee may sound good, but unfortunately, the distribution P has effective support of size 
approximately sy / log(l/e), where s = \/Varx~p[ X ], and thus the resulting distribution will likely 
have L\ error of 0{N~ l ^ 2 s 1 ^ 2 log 1 ^ 4 (l/e)) from P. This bound is prohibitively large, especially 
when the standard deviation of P is large. 











This obstacle can be circumvented by relying on a new structural result that we believe may be 
of independent interest. We show that for any k-SIIRV with large variance, its Fourier Transform 
will have small effective support. In particular, for any fc-SIIRV with standard deviation s and 
e > 0 we consider its Discrete Fourier transform modulo M, and show the set of points in [M — 1] 
whose Fourier transform is bigger than e in magnitude has size at most 0(Mks~ 1 y / log(l/e)). By 
choosing M to be approximately syTogfl/e), i.e., of the same order as the effective support of P, 
we conclude that the effective support of P (modulo M) is 0{k log(l/e)). 

If the effective support for P was explicitly known, we could truncate our empirical Dis¬ 
crete Fourier transform Q (modulo M) outside this set and reduce the L 2 error ||Q — PH 2 to 
jY- 1 / 2 ^ 1 / 2 ^- 1 / 2 log 1 / 4 (l/e). This in turn would correspond to an L\ error of 0(N~ 1 ^ 2 k 1 ^ 2 -y/log(l/e)). 
Unfortunately, we do not know exactly where the support of the Fourier transform is, so we will 
need to approximate it by calculating the empirical DFT where the support might be, and then 
simply truncating this empirical DFT whenever it is sufficiently small. Fortunately, we do have 
some idea of where the support is and it is not hard to show that we can truncate at all of the 
appropriate points with high probability. 

Algorithm Learn-SIIRV 

Input: sample access to a /c-SIIRV P and e > 0. 

Let C be a sufficiently large universal constant. 

1. Draw 0(1) samples from P and with confidence probability 19/20 compute: (a) a 2 , a factor 
2 approximation to Varx~p[A] + 1, and (b) ju, an approximation to Ex~p[Al] to within one 
standard deviation. 

2. Take N = C 3 k/e 2 \n 2 (k/e) samples from P to get an empirical distribution Q. 

3. If a < 4/cln(4/e), then output Q. Otherwise, proceed to next step. 

4. Set M l = 1 + 2[~6(jyTn(4/e))~|. Let 

Hof 

S = {£ G [M — 1] | 3a, 6 G Z, 0 < a < b < k such that |£/M — a/b\ < 0(log(k/e)/M)} . 
For each £ € S, compute the DFT modulo M of Q at £, Q(£). 

5. Compute H which is defined as H(£) = Q(£) if £ € S and |Q(£)| > R := 2 C 1 e/yjk In (k/e), 
and H(£) = 0 otherwise. 

6. Output H which is a succinct representation of H, the inverse DFT of H modulo M onto 
the range [[/LJ — (M — l)/2, \J1\ + (M — l)/2]. 


The bulk of our analysis will depend on showing that the Fourier transform of P has appropri¬ 
ately small effective support. To do this we need the following lemma: 

Lemma 2.3. Let P € S n) k with y^VarA^p[-X] = s, 1/2 > 5 > 0, and M G Z + with M > s. Let P 
be the discrete Fourier transform of P modulo M. Then, we have 

(i) Let C = C(5, M, s ) = f ^£ G [M — 1] | 3a, b £ Z,0 < a < b < k such that \ £/M — a/b\ < 

Then, |P(£)| < <5 for allf, G [M— 1]\£. That is, |P(£)| > 6 for at most \C\ < Af/c 2 s _1 y / log(l/d) 
values off . 

(ii) At most 4MA:s" 1 i v /log(l/(5) many integers 0 < £ < M — 1 have |P(£)| > 6 . 
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Before we proceed with the proof of the lemma some comments are in order. Statement (i) 
of the lemma exhibits an explicit set C of cardinality 0(Mk 2 s _1 yTog(l/ 6)) that contains all the 
points £ E [.M — 1] such that |P(£)| > 5. Note that the set C can be efficiently computed from M, 
5, s, and does not otherwise depend on the particular /c-SIIRV P. Statement (ii) of the lemma 
shows that the effective support £ = £{8) = {£ E [M — 1] | |P(£)| > <5} is in fact significantly 
smaller than £, namely \£\ = 0(Mks~ 1 yTog(l/ 5)). This part of the lemma is non-constructive 
in the sense that it does not provide an explicit description for £ (beyond the fact that £ C£). 
The upper bound on the size of the effective support is the basis for the analysis of our algorithm. 

Proof of Lemma HOI Since P E S n j.. for X ~ P, we have X = ]U/ =1 X, L where each X % ~ Pj for a k- 
IRV P ; . Let Yi = Xi~ X[ be the difference of two independent copies of X % . Let pij = Pr [|1^| = j\ . 
Note that Pi is a symmetric random variable. Consider its DFT modulo M which we will write as 
Yi. We have the following sequence of (in)equalities: 

iL(oi 2 = L(oL(-o = m ) 

= p>« cos ipw) = 1 - X>« (f - (i5r)) 

3=0 V 7 3=1 V V // 

k-1 / fc-1 

< 1 ~8^2Pij\U/ M ) 2 < exp -8 ^PijiU/M} 2 

3=1 \ 3=1 

where [x], x E R, denotes the distance between x and its nearest integer. For the last two inequal¬ 
ities, we used that cos 2ttx < 1 — 8x 2 when |x| < 1/2, and e~ x > 1 — x when x > 0. 

Therefore, we have that |P (£)| 2 = EHLi l p i(£)| 2 < exp(-8 Yh=i Sj=i Pij[U/ M ] 2 )- Taking 
square roots, we obtain 

n k— 1 

|P(£)| < exp (!) 

i=l j=l 

Note that we can relate the variance of P to the Pi/s as follows: 



n ^ 71 -\ n 1 

s 2 = Var[X] = ^VarM = -^EK 2 ] = -^^ PlJ i 2 . (2) 

i=l i=l i= 1 j =1 

Using o, we get 

|P(£)| < exp ^-8s 2 ^rnin 

To complete the proof of (i), we will need a simple counting argument given in the following claim: 

Claim 2.4. For a E R+ j E Z+, there are at most 2 Maj + j integers 0 < £ < M — 1 with the 
following property: there exists c E Z with 0 < c < j such that |£/M — c/j\ < a. Therefore, there 
are at most 2 Ma + j integers 0 < £ < M — 1 with [£j/M] < a. 

Proof. For each c satisfying 1 < c < j— 1 there are either [2MaJ or |_2MaJ + l integers 0 < £ < M— 1 
with | jj — 51 < a. For c = 0 and c = j there are either [M aj or |_ M aj +1 integers with | — j | < a. 

Finally, note that |-^ — j\ < a for some 1 < c < j — 1 if and only if [j£/M] < aj. □ 
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An application of the above claim for a = (l/2s)y / ln(l/5) implies that there are at most 
fc-i 

Y 2Mjs~ l ^\n{\/5)/2 + j < Mk 2 s~ l s/\n{l/S) + k 2 < 2Mfc 2 s"V 1 n(l/5) 

3 = 1 

integers 0 < £ < M — 1 with rnirij (M/M) 2 < ln(l/<5)/(4s 2 ). For all other integers we have 
|P(£)| < <5 , which completes the proof of (i). 

To prove (ii) we proceed by the probabilistic method as follows: Consider evaluating the RHS 
of (H} with £ being an integer random variable uniformly distributed in [M — 1]. For 1 < j< k — 1, 
let Nj be the indicator random variable for the event that [£j/M] < ks~ x ^/\n(l/6) /2. Observe 
that by Claim [2~fl it follows that E[Ay] < 2fcs _1 ^/ln(l/<5). 

Note that [£j/M] > y 2 ! ~ Nj • fes _1 y / ln(l/5)/2. Plugging this into (JTJ) gives 


7 2 n k —1 

|P(£)| < exp ( ln(l/«5)^^ py(l - Nj) 

i= 1 j=l 


Since s 2 = 
Therefore, 


lEwEtiW 2 < TEiiE :!» it follows that 0 := EILiEKm > 2» 2 A 2 


E 


n fc—1 


EE PijNj 

*=1 j=l 


< 6 ■ 2 ks 1 y / ln(l/<5). 


By Markov’s inequality, except with probability 1 y/ln(l/, we have that )E/ =1 Yj=iPijNj < 
In this event, we have Pij(l — ATj) > | and hence 


,2 n 1 \ /■ 7 2 

|P(OI < ex P ( -~2 ln(l/5)X)£ Pi j(l _JV J') I - exp ( _ E2 M 1 /^ ) - 6 - 

2=1 J =1 


Since £ is uniformly distributed on \M — 1], it follows that |P(£)| > S for at most 4 Mks 1 y/ln(l/<5) 
integers £ in [M — 1]. This completes the proof of (ii). □ 

We are now ready to prove Theorem 12.11 

Proof of Theorem \2.1[ Note that it is straightforward to verify the sample complexity bound. The 
running time of the algorithm is dominated by computing the DFT Q. Since the support of Q is at 
most N, for each £ € S', we sum at most N terms to calculate Q(£). Therefore, the overall running 
time is 0(N ■ |S|) = 0(klog 2 (k/e)/e 2 ■ k 2 log(k/e)) = 0(k 3 log 3 (fe/e)/e 2 ) as claimed. 

To show correctness, we will prove that the expected squared L 2 norm between H and P is 
small, i.e., that ||H — P||| = (1/M) • |H(£) — P(£)| 2 has small expected value. 

It is easy to see that, after drawing a constant number of samples, the quantities J1 and a can 
be estimated to satisfy the required conditions with probability at least 19/20. (This follows for 
example by Lemma 6 of |DDS12b] with e = 1/2.) We will henceforth condition on this event. 

If a < 4/cln(4/e), then s < 2k ln(4/e) + 1, and Bernstein’s inequality implies that X ~ P is 
within 0(k log(l/e)) of the mean with probability 1 — e/2. In this case, 0(k log(l/e)/e 2 ) < N 
samples are sufficient to give that cFrv(Pi Q) < e with probability 2/3. (This follows from the fact 
that any distribution over support of size L can be learned with 0(L/e 2 ) samples to total variation 
distance e.) We henceforth assume that we have \p — Jl\ < s, s > < 7 /2 > 2k In (4/e) and a < 2s. 
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Since M = 1 + 2|~6(7y / ln(4/e))~|, a random variable X ~ P lies in [|_/rj — (M — l)/2, \jl\ + (M — 
1)/2] with probability at least 1 — |. Indeed, an application of Bernstein’s inequality for X yields 
that 

Pr(X > /i + t) < exp 

where ^ is the mean of P, for any t > 0. For t = 2sy / ln(4/e), we have t 2 = (ln(4/e))4s 2 and 
2s 2 + = 2s 2 + l^sydn^/e) < |s 2 < 4s 2 . Thus, Pr(X > /i + t) < e/4. Similarly, it holds 

Pr(X < — t) < e/4. Now note that \]1\ + (M — l)/2 > (fi — s) + |"3sy / ln(4/e))] > fi + t and 
L//J — (M — l)/2 < /j, — t. Hence, X is in — (M — l)/2, \J1\ + (M — 1)/2] with probability at 
least 1 — e/2 as desired. 

Fix T = R/2 = C~ l e/(\/k ln(fc/e)). We analyze separately the contribution to the squared L 2 
norm coming from £ with |P(£)| > T and with |P(£)| < T. Let us denote C!(T ) = {£ € [M — 1] | 
|P(^)| > T}. First consider 

(1/M). |H(0-P(0I 2 - 

i&a[T) 

We first claim that with high probability H(^) = 0 for all ^ G £{T). This happens automatically 
when £ 0 S, where the S is defined in the algorithm description. Note that |5| = 0{k 2 \og(k/e)). 
For £ € S we note that Q(£) is an average of N i.i.d. numbers each of absolute value 1 

and mean P(£) (which has absolute value less than 1). Note that if |Q(£) — P(£)| > R — T, then 
either the real or the imaginary part is at least (R — T)/\/2. By a Chernoff bound, the probability 
that for a given £ G S \ £{T ), 9i(Q(£) — P(£)) > (R — T)/\/2 is at most 2exp(— N(R — T) 2 / 4). 
The same is true of the imaginary part so by a union bound the probability that |Q(£) — P(£)| > 
R — T is at most 4exp(— N(R — T) 2 /4). Again by a union bound we get that the probability that 
any £ G S \ £(T) has |Q(£) — P(£)| > R — T is at most 0{k 2 \og{k/e) exp(— N(R — T) 2 / 4)) = 
0(k 2 \og{k/e) exp(—CTn(fc/e))) = 0(e c ~ l ). Hence, except with probability 0(e c ~ l ), for all £ in 
S\£(T ) we have |Q(£) — P(£)| < R — T and so |Q(£)| < R. In fact, the total expected contribution 
to the squared L 2 norm coming from cases when H(£) is not identically 0 on all such £ is also 
0(e c_1 ). Therefore, up to negligible error, the squared L 2 error coming from this range is at most 

r>0 V / 


2s 2 + |/cf 


Applying Lemma 12.31 (ii) with 5 := T 2 r 1 for each r > 0, this is at most 


E ( r2_r ) 2 

r> 0 


f : |P(Q| > T2" r ~ 1 } \ 

l M ) 


< r 2 4- r 4fes -1 0og(2 r /T) 

r> 0 

< 8T 2 ks~ 1 \/log(l/T). 


We now consider the remaining contribution 

(1/M). Y |H(£)-P(£)| 2 . 

ZeC'(T) 


By Lemma [2731 (i) applied with S := T, it follows that £(T) C C(T, M, s). Since yTn(1/T)/2s = 
0(log(k/e)/M), we can choose the constant in the definition of S so that C(T,M,s) C S. So, for 
£ G C(T), we do compute Q(£) and then either H(£) = Q(£) or |Q(£)| < R and H(£) = 0. In 
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either case, we have that |H(£) — Q(£)| < R. Recall that the expected size of |Q(£) — P(£)l 2 is 1/N 
for any £ € [M — 1]. So, for £ € £(T), the expected squared error at £ satisfies E[|H(£) — P(£)| 2 ] < 
2(R 2 + N~ 1 ). 

By Lemma HOI (ii) applied with 6 := T, we have \£{T)\ < 4 ks~ 1 y/ln(l/T). So, the expected 
size of the L\ error on £ (T) has 

E[(l/M)- |H(£) — P(£)| 2 ] < 4(R 2 + N~ 1 )(2ks~ 1 \/\nfl/T)) . 

ZeC'(T) 

Combining the above results, we find that the expected L 2 error between H and P is at most 

4 (R 2 + N- 1 + T 2 )(2fes- 1 v / log(l/T)) = 0(C~ l s^e 2 / yf\og{k/e)). 

Therefore, if C is sufficiently large, Markov’s inequality yields that, with probability |, we have 
||H - P||| < e 2 /M. 

At this point, we would like to use Plancherel’s theorem followed by Cauchy-Schwartz to com¬ 
plete the proof. Formally, since P may be supported outside — (M— l)/2, \p\ +(M— l)/2], we 
cannot use Plancherel’s theorem directly to show that ||H — P11 2 = ||H — PH 2 - Instead, consider the 
function P' : [\J1\ - (M - l)/2, \Ji\ + (M - l)/2] flZ^[0,1] defined as P'(i) = (mod m) p C0 

for \p\ — (M — l)/2 < i < L/iJ + (M — l)/2. Note that P' = P by the definition of the DFT 
modulo M, since e(£j'/M) = e(£i/M) when j = i (mod M) for all £ € [M — 1] and i,j € [n]. Thus, 
||H — P'Hi < e 2 /M and Plancherel’s theorem gives ||H — P '||2 = ||H — P r H 2 < e/y/M. Since P 7 has 
support at most M, an application of Cauchy-Schwartz gives ||H — P 7 1| 1 < ||H — P'|| 2 \/M < e. 

Since X ~ P is in [\]1\ — (M — l)/2, \J1\ + (M — 1)/2] with probability at least 1 — e/2, we 
have ||P — P 7 1| 1 < e and so ||P — H||i < ||P — P'||i + ||H — P'||i < 2e. Since H(0) = Q(0) = 1, it 
follows that = 1. Also, by symmetry, all the H(i)’s are real. This completes the proof 

of Theorem 12.11 □ 


2.2 A General Fourier Learning Algorithm The algorithmic approach of the previous sub¬ 
section is not specialized to L-SIIRVs, but is applicable more generally. In essence, the approach 
really only depended upon two facts: 

• P is effectively supported on a small set T. 

• P is effectively supported on a small set S. 


It turns out that by using similar ideas, we can learn any probability distribution with these prop¬ 
erties. The following simple theorem provides a generalization for integer-valued random variables. 
However, the approach can also be generalized to higher dimensions and to continuous distributions. 


Theorem 2.5. Let P be an integer-valued random variable and e > 0. Let T C Z and S C K/Z be 
known subsets so that: 


£ P(«) < e/3, 

nSZ\T 


and [ |P(£)| 2 d£ < e 2 /(9|T|). 

4$e(K/Z)\5 


Then, there exists an algorithm which learns P to total variational distance e using N = 0(\T\[v(S)/e 2 ) 
samples, where n(S) is the Lebesgue measure of S. 
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The generic algorithm is as follows: 


Algorithm Learn-Sparse-FT 

Input: sample access to a distribution P over [n] and e > 0. 

Let C be a sufficiently large universal constant. 

1. Take N = C\T\/j,(S)/e 2 samples from P to get an empirical distribution Q. 

2. Compute H which is defined as H(£) = Q(£), if £ € S’, and H(£) = 0 otherwise. 

3. Output H, where H is the inverse Fourier transform on H restricted to T. In particular 
H(i) = f^ eS e(— n£)H(£)d£ for i € T and 0 for i 0 T. 


Note that this is exactly the form of the algorithm for learning /c-SIIRVs, except that the latter 
algorithm must also learn T (which is done by computing an approximate mean and variance) and 
S (which is obtained through a thresholding procedure). Also, note that we use the continuous 
Fourier transform here rather than a discrete Fourier transform. This is mostly for conceptual 
convenience. In practice, the continuous Fourier transform can be replaced by a sufficiently fine 
discrete Fourier transform, yielding an algorithm in which the integrals can be replaced by finite 
sums. 

The analysis of the algorithm is not difficult. We begin by bounding that the expected L 2 
difference between P and H. In particular, we note that 



p(0-h(0I 2 



< 



H(£)| 2 +/ |P(£) 

J?e(K/z)\s 
H(£)| 2 + e 2 /(9|T|). 


H(£)| 2 


Now, for any given value of £, we note that P(£) — H(£) has mean 0 and variance at most 1/N. 
Therefore, we have that 


E 



P(0-H(0| 2 


< »(S)/N 


e 2 / (C\T\). 


For C large enough, by the Markov inequality, this is at most e 2 /(9|T|) with probability at least 
2/3. If this holds, then 



H(£)| 2 < e 2 /(4|T|). 


By Plancherel’s Theorem, this would imply that the squared L 2 distance between P and the inverse 
Fourier transform of H is at most e 2 /(4|T|). Along with Cauchy-Schwartz, this implies that 


E i p ( n ) - H ( n )i ^ VWJWmn = e/2. 

neT 

On the other hand, 

E |P(")-H(n)|= E p (") < e/3. 

nSZ\T nSZ\T 

Therefore, dTu(P,H) < e. 
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2.3 An Efficient Sampler for our Hypothesis The learning algorithm of Section \2 .II outputs 
a succinct description of the hypothesis pseudo-distribution H, via its DFT. This immediately 
provides us with an efficient evaluation oracle for H, i.e., an e-evaluation oracle for our target 
SIIRV P. The running time of this oracle is linear in the size of S, the effective support of the DFT. 

Note that we can explicitly output the hypothesis H by computing the inverse DFT at all the 
points of the support of H. However, in contrast to the effective support of H, the support of H 
can be large, and this explicit description would not lead to a computationally efficient algorithm. 
In this subsection, we show how to efficiently obtain an e-sampler for our unknown /c-SIIRV P, 
using the DFT representation of H as a black-box. In particular, starting with the DFT of an 
accurate hypothesis H, represented via its DFT, we show how to efficiently obtain an e-sampler 
for the unknown target distribution. We remark that the efficient procedure of this section is not 
restricted to fc-SIIRVs, but is more general, applying to all univariate discrete distributions for 
which an efficient oracle for the DFT is available. 

In particular, we prove the following theorem: 

Theorem 2.6. Let M € Z + , and a,b E Z with b — a = M — 1. Let H : [a, b] — >• M be a pseudo¬ 
distribution succinctly represented via its DFT (modulo M), H, which is supported on a set S, i.e., 
H(x) = (1/M) • X^eS e ( — ‘£ • aj)H(£), for x € [a, 6], with 0 € S and H(0) = 1. Suppose that there 
exists a distribution P with cZtv(H, P) < e/3. Then, there exists an e-sampler for P, i.e., a sampler 
for a distribution Q such that dTv(Pj Q) < e, running in time 0(log(M) log (M/e) • \S\). 

Combining the above with Theorem 12.11 we get: 

Corollary 2.7. For all n,k € Z + and e > 0, there is an algorithm with the following performance 
guarantee: Let P £ S n ^ be an unknown k-SIIRV. The algorithm uses 0(klog 2 (k/e)/e 2 ) samples 
from P, runs in time 0(k 3 /e 2 ) ■ logn, and with probability at least 9/10 outputs an e- sampler for 
P. This e-sampler produces a single sample in time 0(k log 2 (fcn) log 2 (fc/e)). 

Proof. For the output of algorithm Learn-SIIRV, M = 0((1 + cr)y / log(l/e)) = 0(kn) and |S| < 
|£'(T)| < 2Mks~ l ydi^l/T) = 0{k \og(k/e)). □ 

Note that we can effectively reduce the fc-SIIRV learning problem to the case of n = poly (k/e). 
We can use this fact as a simple bootstrapping step to eliminate the logarithmic dimension on n in 
the runtime of the above described sampler. The details are deferred to Appendix lC.il 

This section is devoted to the proof of Theorem 12.61 We start by providing some high-level in¬ 
tuition. Roughly speaking, we obtain the desired sampler by the Cumulative Distribution Function 
(CDF) corresponding to H. We use the DFT to obtain a closed form expression for the CDF of H, 
and then we query the CDF using an appropriate binary search procedure to sample from the distri¬ 
bution. One subtle point is that H(x) is a pseudo-distribution, i.e. it is not necessarily non-negative 
at all points. Our analysis shows that this does not pose any problems with correctness. 

Our first lemma shows that it is sufficient to have an efficient oracle for the CDF: 

Lemma 2.8. Given a pseudo-distribution H supported on [a,b\ OZ, a, b € Z, with CDF ch(x) = 
Tl,ra<i<x H(i) (which satisfies cn(b) = l), and oracle access to a function c(x ) so that \c(x) — 
c-r(x)\ < e/(10 (b — a + 1)) for all x, we have the following: If there is a distribution P with 
dTv(H, P) < e/3, there is a sampler for a distribution Q with d^v (P, Q) < e, using 0(log(6 + 1 — 
a) + log(l/e)) uniform random bits as input, running in time 0{(D + l)(log(i> + 1 — a)) + log(l/e)), 
where D is the running time of evaluating the CDF c(x). 
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Proof. We begin our analysis by producing an algorithm that works when we are able to exactly 
compute cn(x). 

We can compute an inverse to the CDF dn : [0,1] —>• [a, 6] ClZ, at y E [0,1], using binary search, 
as follows: 

1. We have an interval [o', b'], initially [a — 1,6], with ch(o') <y< ch( 6') and ch(o') < ch(6'). 

2. If b' — a' = 1, output dn(y) = b'. 

3. Otherwise, find the midpoint d = [(o' + 6')/2j. 

4. If ch(o') < ch(c') and y < ch(c'), repeat with [a',c']; else repeat with [c',6]. 

The function du can be thought of as some kind of inverse to the CDF ch : [a — 1, 6] 0 Z —>• [0,1] 
in the following sense: 

Claim 2.9. The function du satisfies: For any y € [0,1], it holds cn(dn(y) — 1) < y < cu(du{y)) 
and c H (d H (y) - 1) < cn(dn(y))- 

Proof. Note that if we don’t have ch(o') < ch(c') and y < ch(c'), then ch(c') < y < ch(6'). So, 
Step [H gives an interval [o', 6'] which satisfies ch(o') < y < ch(6') and ch(o') < ch(6'). The initial 
interval [a — 1,6] satisfies these conditions since ch(o — 1) = 0 and ch(6) = 1. By induction, all 
[o', 6'] in the execution of the above algorithm have ch(o') < y < ch(6') and ch(o') < ch(6'). Since 
this is impossible if a' = 6', and Step [4] always recurses on a shorter interval, we eventually have 
6' — a' = 1. Then, the conditions ch(o') < y < ch(6') and ch(o') < ch(6') give the claim. □ 

Computing du(y) requires 0(log(6 —a+1)) evaluations of ch, and 0(log(6—a+1)) comparisons 
of y. For the rest of this proof, we will use n = b — a + 1 to denote the support size. 

Consider the random variable dn(Y), for Y uniformly distributed in [0,1], whose distribution 
we will call Q'. When dn(Y) = x, we have cn(x — 1) < Y < ch (x). and so when Q'(x) > 0, we 
have Q'(x) < Pr \cn(x — 1) < Y < ch(®)] = cn(x) — ch(£ — 1) = H(x). So, when H(x) > 0, we 
have H(x) > Q'(x). But when H(x) < 0, we have Q'(x) = 0, since then c h(x) < ch(x — 1) and no 
y has cu(x - 1) < y < c H (+). So, we have d T v( Q', H) = J] 3 :: h(x)<o “Hi 31 ) < ^tv(H, P) < e/3. 

We now show how to effectively sample from Q'. The issue is how to simulate a sample from 
the uniform distribution on [0,1] with uniform random bits. We do this by flipping coins for the 
bits of Y lazily. We note that we will only need to know more than m bits of Y if Y is within 2 -m 
of one of the values of ch (x) for some x. By a union bound, this happens with probability at most 
n2~ m over the choice of Y. Therefore, for m > log 2 (10n/e), the probability that this will happen 
is at most e/10 and can be ignored. 

Therefore, the random variable c?h(T'), for Y ’ uniformly distributed on the multiples of 2~ r in 
[0,1) for r = 0(logn + log(l/e)), has distribution Q' that satisfies d^y( Q, Q') < e/10. Therefore, 
drv( P,Q') < dTv(P,H) + dTv(H,Q) + d^vi Q,Q') < 9e/10. This is an e-sampler that uses 
0(logn + log(l/e)) coin flips, O(logn) calls to cn(x), and has the desired running time. 

We now need to show how this can be simulated without access to ch and instead only having 
access to its approximation c(x). The modification required is rather straightforward. Essentially, 
we can run the same algorithm using c(x) in place of cn(x). Observe that all comparisons with Y 
will produce the same result, unless the chosen Y is between c{x) and cn{x) for some value of x. 
We note that because of our bounds on their difference, the probability of this occurring for any 
given value of x is at most e/(10n). By a union bound, the probability of it occurring for any x is 
at most e/10. Thus, with probability at least 1 — e/10 our algorithm returns the same result that 
it would have had it had access to cr{x) instead of c(x). This implies that the variable sampled 
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by this algorithm has variation distance at most e/10 from what would have been sampled by our 
other algorithm. Therefore, this algorithm samples a Q with dTv(P 5 Q) < e. □ 


We next show that we can efficiently compute an appropriate CDF, using the DFT. 

Proposition 2.10. For H as in Theorem IS. 61 there is an algorithm to compute the CDF ch : 
[a, b] nZ-> [0,1] with cn(x) = Yli- a <i<x H(i) to any precision 6 > 0, where b — a = M — 1, 
M G Z+. The algorithm runs in time 0(151 log(1/5)). 

Proof. Recall that the PMF of H at x € S is given by the inverse DFT: 

H w4£ e( ^ /M)i( ^ (3) 

?6 S 

The CDF is given by: 

cu(x) = E ^ ^ E e(-es/M). 

i:a<i<x £eT i:a<i<x 


When £ / 0, the term Yli-a<i<x e (“ C x /M) is a geometric series. By standard results on its sum, 
we have: 


E e (“£ x ) 

i:a<i< x 


e{-ia/M) - e(-€(x + 1 )/M) 
1 _ e {-i/M) 


When £ = 0, e(—£) = 1, and we get Yla<i<x e (~^ x /-^) = i + 1 — a. In this case, we also have 
H(0 = 1. Putting this together we have: 


ch(x) 


1 

M 



+ E 

«6S\{0} 


e(—£a/M) - e(-£(x + 1)/M) \ 
l-e(-e/M) I 


(4) 


Hence, we obtain a closed form expression for the CDF that can be approximated to desired 
precision in time 0(|5| log(l/5)). □ 


Now we can prove the main theorem of this subsection. 

Proof of Theorem \2.6[ By Proposition 12.101 we can efficiently calculate the CDF of H. So, we 
can apply Lemma 12.81 to this CDF. This gives us an e-sampler for H. To find the time it takes to 
compute each sample, we need to substitute D = O (|5| log(M/e)) from the running time of the 
CDF into the bound in Lemma 12.81 yielding 0(logM -log(M/e)) • |5| time. This completes the 
proof. □ 


2.4 Sample—Optimal Learning Algorithm In this subsection, we show how to improve the 
sample complexity of our learning algorithm for A:-SIIRVs given in Section 12.11 and obtain an 
algorithm with optimal sample complexity (up to constant factors). The basic idea behind the 
improvement is as follows: In our previous analysis, we made critical use of the fact that essentially 
all of the mass of the distribution in question lies in an explicit interval of length ©(sydog^l/e)), 
where s is the standard deviation. By using our Fourier learning approach, we were able to learn 
a distribution that approximated our target on this support. In order to improve this algorithm, 
we observe that although it is necessary to move f2(yTog(l/e)) standard deviations from the mean 
before the cumulative density function (CDF) drops below e, the CDF has already begun to drop 
off exponentially after only a single standard deviation from the mean. 
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Unfortunately, applying a sharp threshold to our Fourier transform (as in Section T2.II) can lead 
to effects that fall off relatively slowly with distance. Note that such a sharp thresholding in the 
Fourier domain is equivalent to convolution with a Sine function, which has tails proportional to 
l/|x|. In order to correct this issue, we will instead perform our thresholding by multiplying by 
a function with smooth cutoffs. This smooth thresholding step corresponds to convolving with a 
function of width approximately s with Gaussian tails. We remark that this step has the critical 
effect of causing our expected errors to be much smaller at points further from the mean, since 
most of our samples (within a few standard deviations of the mean) will have little effect on our 
output for these points. A careful analysis of the expected error at each point will yield our final 
bound. 

We will warm up in Section 12.4.11 where we describe our algorithm in the case of 2-SIIRVs. 
This will exhibit the important new ideas of this technique. Then, in Section T2. 4. 21 we extend these 
results to fc-SIIRVs, which brings with it several technical complications, mostly arising from the 
fact that we do not know a priori a good effective support for the Fourier transform. 


2.4.1 Sample Optimal Learning Algorithm for 2-SIIRVs In this subsection, we will prove 
the following theorem: 

Theorem 2.11. There exists an algorithm that given N = 0(yTog(l/e)/e 2 ) independent samples 
to a 2 -SIIRVX, runs in time O(N) and with probability at least 2/3 outputs a hypothesis distribution 
Y that is within e of X in total variational distance. 


Our new algorithm Learn-2-SIIRV-0ptimal-Sample is described in pseudocode below. We 
first provide an equivalent alternative interpretation of our algorithm in terms of truncating the 
Fourier transform. As in our algorithm Learn-SIIRV of Section 12.11 we start by obtain approx¬ 
imations d 2 and pt for the variance and mean. Similarly, we output the empirical distribution if 
a < ©(yln/l/e)). This allows us to assume that d = fI(yTn(l/e)). (Note that this bound is not as 
strong as that in Learn-SIIRV because in the current setting we aim to use fewer samples.) 

Our new learning algorithm proceeds by computing the empirical Fourier transform of X and 
truncating it in a judiciously chosen way. Let G(f), ( £ 1, be a Gaussian of standard deviation 
1/a taken modulo 1. More specifically, let 

V 1 . e -5 2 (™+0 2 /2 _ 




Let 1(f) be the indicator function of the interval [— Ca 1 y / log(l/e), Ca 1 y / log(l/e)], for C a 
sufficiently large constant. Let F be the convolution of I and G , i.e., F = I * G. We note that 
multiplication by F is an appropriate method of thresholding. In particular, we start by showing 
that F approximates I in the following way: 

Claim 2.12. (i) F(f) € [0,1] for all f. 

(ii) F(f) > 1 - e 2 for |f| < (C - 3 )d~ 1 ^/\og(l/e). 

(in) F(f) < e 2 for ± > \f\ > (C + 3 )d~ 1 ^/\og(l / e). 

Proof. Note that F is the convolution of / and G. We can write: 


^ ri+Cc 1 y/log(l/e) ^ 

m = / 7 _ G(v)dv < 

Ji - C < T - 1 y /\ og(l/e) 



y/tojr 


= e -- 2 (0 2 /2 df = 1. 


(5) 
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Clearly, this convolution is positive at all points. Thus, we get (i). 

When |£| < (C — 2>)a~ l ^/\og{l/e), note that the integral in © is over 

v G [£ - Ca~ l s/\og{\/e), £ + Ca~ l yJ\ og(l/e)] D [-35 -1 y/\og(l/e), 3 a~ l s/\og{\/e)\. 

By standard tail bounds, the Gaussian —f=e -cr2l/2 / 2 3 has all but 1 — e 2 of its mass in the interval 

J ’ V27T 

[—3(7 _1 yTog(l/e), 3lf _1 y / log(l/e)], and so F(£) > 1 — e 2 . This gives us (ii). 

When -j > |£| > (C + 3)cr _1 y / log(l/e), the integral in d5]) is over v € [£ — C'<7“ 1 y / log(l/e), £ + 
C'5 : ~ 1 y / log(l/e)], which is disjoint from the interval [—3<r _1 yTog(l/e), 3<7 _1 yTog(l/e)]. By the 
same bound, the Gaussian has at most e 2 of its mass outside [—3 cj -1 yTog(l/e), 3a _1 yTog(l/e)]. 

So, we deduce (iii). □ 

At a high-level, our new algorithm involves the following steps: 

1. Let Z be the empirical distribution and Z be the (continuous) Fourier transform of Z. 

2. Let ?(£) = Z(£)F(£). 

3. Let Y be the truncation of the inverse Fourier transform of Y to [// — C<ryTog(l/e), ju + 

Co \/log(l/e)], for C a sufficiently large constant. 

Both to aid in the performance of this computation and the theoretical analysis, we note another 
way to obtain the same answer. As Y is the truncation of the inverse Fourier transform of a 
pointwise product of Z and F, we may instead write it as the truncation to the same interval 
\p — C5 : y / log(l/e), Jl + Co\J\ og(l/e)] of the convolution of Z and F : Z —> M, the inverse Fourier 
transform of F. We show below IClairn [2. 1 31) that 

F(x) = e -12 ^ 20-2 ^^ -1 v / log(l/e)Sinc(27rC'5 : ^ 1 -\/log(l/e)x) , 

def 

where Sinc(x) = (sinx)/x. Also note that F can be computed explicitly to within absolute error 6 
in time poly (log (1/5)), and thus this convolution can be computed efficiently, yielding an alternative 
algorithm for computing Y. 

Algorithm Learn-2-SIIRV-0ptimal-Sample 

Input: sample access to a 2-SIIRV X and e > 0 

Output: A hypothesis pseudo-distribution Y that is e-close to X 

1. Draw 0(1) samples from P and with confidence probability 19/20 compute: (a) a 2 , a factor 
2 approximation to Varjv^ppf] +1, and (b) /I, an approximation to Ex~p[A] to within one 
standard deviation. 

2. If a > f2(l/e), draw 0(l/e 2 ) samples and use them to estimate the mean and variance of 
X. Output a discrete Gaussian with this mean and variance. 

3. Take N = 0(yTog(l/e))/e 2 samples from P to get an empirical distribution Z. 

4. If a < 0(-y/ln(l/e)), then output Z. Otherwise, proceed to next step. 

5. If M, the difference between the largest and smallest sample is fi(cf-y/log(l/e)), output fail. 

6 . Compute F(x) to within 0(e 4 5 6 7 ) for integers \x\ < M + Ca yTog(l/e). 

7. Compute the convolution Y of Z and F using the FFT (modulo > 2 M + 2C'ay / log(l/e)). 
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We start by analyzing the running time of the algorithm. First note that the first two steps 
run in sample-linear time, i.e., 0(l/e 2 ). We now focus on the running time of the remaining steps. 
Note that computing the empirical distribution Z takes time O(N). Computing the values of 
F(x) in Step 6 up to an additive error poly(e) can be done in time Mpolylog(l/e), where M = 
0(d\J log(l/e)) = 0(l/e). Computing the convolution is done using the FFT modulo a power 
of two that is @(M), and so can be done in time 0(M log M). So, the overall running time is 
0(N + M log Mpoly(log(1/e))) = O(N). 

We now proceed to show correctness. In the proof of Theorem l2.ll we argued that 0(1) samples 
suffice to get that with high probability a and J1 satisfy the desired bounds. We condition on this 
event. We claim that when a is small, namely 0(-y/ln(l/e)), the empirical distribution suffices. This 
follows from the fact that the empirical estimate of a discrete distribution P has expected variation 
distance < e from P after O(11P11 1 / 2 / e2 ) samples. By an application of Bernstein’s inequality (see 
Lemma fC.3p it follows that a 2-SIIRV with standard deviation a has 1/2-norm bounded from above 
by 0 (<t + 1). This proves our claim. 

We also note that if the standard deviation of X is ST (1/e), then X is e-close to a discretized 
Gaussian with the same mean and variance. Indeed, for any 2-SIIRV with mean /1 and standard 
deviation < 7 , we have dxv(V, G) < 0(1/a), where G ~ Z(fi, a 2 ). (See, e.g., Theorem 7.1 of jCGSTlj.) 
In this case, we claim that Step 2 of the algorithm outputs an e-accurate hypothesis. Indeed, by 
Lemma 6 of [DDS15j it follows that with 0(l/e 2 ) samples from a discrete distribution, we can obtain 
(in sample-linear time) estimates fi and a such that with high constant probability |/2 — fj\ < ea 
and |<7 2 — er 2 | < ec 2 y / 4 + 1 /cr 2 . Proposition IA.4I completes the proof of our claim. 

So, we henceforth assume that a is Jl(yTn(l/e)) and 0(l/e). We now proceed with the main 
part of the analysis. We start with the following simple claim: 

Claim 2.13. We have that 

F(x) = e _3;2 /( 2<T2 )2C'a _1 v / log(l/e)Sinc(27rC'3 : ^ 1 y / log(l/e)x) , 
for all x € Z. Also, \F(x)\ = Ofa^ 1 ) \/log(l/e) exp(— Pl((x/a) 2 )). 

Proof. As F is a convolution of functions, F(x) is the pointwise product of G(x) the inverse Fourier 
transform of G(tf) with S(x), the inverse Fourier transform of /(£). We define G(x) := e~ x ^ 2a ) 
and S(x) := 2C<r^ 1 y / Iog(l7e)Sinc(27rC3 : “ 1 ^/logpyi) x). Standard results for the Fourier transform 
of the Gaussian and Sinc(x) give us the result. Since |Sinc(a:)| < 1 for all x, we have that |F(x)| = 
0(a _1 )v / iog(T7eyexp(-n((a;/5 : ) 2 )). □ 

In order to show the correctness of our algorithm, we will need to introduce a new distribution, 
Y'. We let Y' be the truncated inverse Fourier transform of the pointwise product of F with X 
(note that Y differs from Y' by using Z instead of X). We begin by showing that W) is 

small. To do this, we let Y' = XF. 

Claim 2.14. We have that dTy(V, Y') = 0(e 2 log(l/e)). 

Proof. Note that 

%)-no=%)(!-%))■ 

If [0 < (C — 3)cj _ 1 \/log(l/e), by Claim l2T2l fiih we have 1 — F(f) < e 2 . Since |V(^)| = |IE[e(V£)]| < 
1, in this case we have |V(£) — Y'(£)| < e 2 . Otherwise, if [^] > (C — 3)3f— 1 -v l /log(l/e), by Lemma 
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part (i) it follows that |X(£)| < exp(—0(<j 2 [^] 2 )). Since 0 < 1 — F(£) < 1, in this case we have 
|*(0 - noi < |*(£)||1 - %)| < exp(-Q(5 2 [£] 2 )). Therefore, 

^ - r 1 / 2 ~ ~ 

IX-Y'I 1= / \X({)-Y'(0ldt 

J- 1/2 

r ~(C- 3)?-Vlog(l/e) _ _ 

= / l*(0-n£M + 

.7-1/2 

/•V2 ^ 

+ / ,_ix(o-noK 

•V-(C-3)o- 1 v / log(l/e) 

1/2 


r(C-3)cr 1 -y/log(l/e) 
l~(C- 3)5 : - 1 v / log(l/£) 




<e 2 • 2(C - 3)ff"Vlog(l/e) + 2 
=0(e 2 y / log(l/e)/5 : ) + V2tt/ s ■ 


(C—3)a 1 y/\og(l/e) 


exp(-fi(<r [£] ))df 


Pr 

W~Af(0,5?- 2 ) L 


|W| > (C7 - 3)o?-Vlog(l/e) 


<0(e 2 v / log(l/e)/cr) . 

Taking an inverse Fourier transform implies that |X — T^oo = 0(e 2 /cr), within the domain of 
truncation. Since this domain has size 0(a yTog(l/e)), we have that the L\ error between X and 
Y 1 within this domain is 0(yTog(l/e)e 2 ). However, both X and Y' have at most 0(e 2 ) mass outside 
of this domain, and therefore we have that (Itv(-X,Y') = 0(e 2 yTog( 1/e)). □ 


It remains to bound from above cItv(Y, Y'). In particular, we will show that d,Tv{Y,Y') has 
expectation 0(e). Then, by decreasing e by a constant factor and applying the Markov and triangle 
inequalities, we will have that d^y(X,Y) < e with probability at least 2/3. 

Proposition 2.15. We have that E [d TV {Y,Y')\ < 0(e). 

Proof. Recall that Y is a the convolution of Z with F. If we consider our samples to be random 
variables Xn),... ,X( N \ each of which is an i.i.d. copy of X, we can express Y(p) for a given p as 
a random variable: Y(p) = -T Y^iLi F{p ~ ^(i)) , for a < p < b, where a = Jl — CcryTog(l/e) and 
b = J1 + C'3 : y / log( 1/e). Note that the expectation of Y(p) is 

1 N 

- ]T E x [F(p - X)] = E x [F(p - X)] = Y\p). 

i=1 


Therefore, we have that E[|Y(p) — y'(p)|] = 0(y / Var(y (p))). For the variance we have the following 
sequence of (in)equalities: 


Var[y(p)] = Var[F(p - X)]/N = E 


F(p-X)-^F(p-q)X(q) 


q=a 


/N 


: ^(X(r)//V). (F 2 (p-r) 

r=a \ 


+ V (V 2 (;p - q)X(q) 2 - 2 F(p - r)F(p - q)X(q) + 2 £ F(p - q)F(p - q , )X(q)X(q')) 
v q'H y , 


q=a 


--1/N ■ ]T F 2 {p - q)(X(q) - X(q) 2 ) < 1/N • £ F 2 (p - q)X(q) . 


q=a 


q=a 
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We claim that this quantity will become small as p moves away from p. Intuitively, this should be 
the case because for p far from p, then for all q either | p — q\ will be large or \q — p\ will be large. 
In the former case, F(p — q) is small, and in the latter X(q) is. In order to properly analyze this 
quantity, we will have to group up these errors for p in blocks of size a. In particular, we have that 

fi+(t+l)a /i+(t+1)5 

E n\Y(p)-Y'(p)\]= E 0(VVax(r(p))) 

p=p,-\-tcr p=p-\-tcr 


p,+(t+l)a 

o(i/Vn) E 

p=p-\-t(7 




Y. F2 (p-q)X(q) 


q=a 


= 0(y/W/N) 


N 


p,-\-(t-\-l)cr b 

e £^(p-?w 9 ) 

p=p,-\-tcr q=a 


(by Cauchy-Schwartz) 


<0(^d/N) 


\ 


b-p—(t-\-l)cr p-\-(t+l)cr—r 

E fht) e x w 

r=pL-\-tcr—a q=p-\-t<j—r 


< Oi^/ajN) 


\ 


E F 2 (r) exp(—n(|f<7 — r|/cr) 2 


(by Bernstein’s inequality) 


= o{^Jn) 


\ 


E S 2 (r) exp(—fi(((tcr — r)/a) 2 + (r/cr) 2 )) 


= 0(^jN) 


\ 


E S 2 {r) exp(—fl(t 2 )) 


= 0( V / ^V)exp(-0(t 2 ))W J^I 2 (£)d£ 


(by Plancherel’s Theorem) 
= 0(^a/N) exp(—n(t 2 ))cr _1 u 1,/2 log 1 ^ 4 (l/e) = 0(log 1 ^ 4 (l/e)/-\/lV : ) exp(—fl(t 2 )). 


Summing over t gives that 

E[d TV (Y,Y')} = 0{\og l '\l/e)/y/N) = O(e) , 

for N = i v /log(l/e)/e 2 . This completes the proof. □ 


2.4.2 Sample Optimal Learning Algorithm for fc-SIIRVs 

Theorem 2.16. For e < l/poly(fc), there exists an algorithm that given 0(/cy / log(l/e)/e 2 ) inde¬ 
pendent samples from a k-SIIRV, X , with probability at least 2/3 outputs a hypothesis distribution 
Y that is within e of X in total variational distance. 

The proof of this theorem is somewhat analogous to that of Theorem l2.111 However, it should 
be noted that the runtime of this algorithm is not given. This is because the runtime of the 
simplest such algorithm is actually exponential in k. The difficulty is that while in the 2-SIIRV case 
we could determine the effective support of the Fourier transform just from the standard deviation, 
in the case of fc-SIIRVs this is not the case. In essence, our algorithm will first guess this effective 


22 



























support (of which there are exponentially many possibilities), and then given this guess will run 
an appropriate algorithm. At the end, we will need to run a standard tournament procedure 
(e.g., mm) to determine which of these guesses lead to the closest approximation to X. Since 
the number of possibilities is 2°^ (see Claim |2~.19D . the sample complexity of this tournament is 
0{k/e 2 ). 

As in Algorithm Learn-SIIRV, we begin by estimating the mean and variance with 0(1) sam¬ 
ples, producing estimates fi and a 2 with (Var[X] + l)/2 < a 2 < 2(Var[A] + 1) and E[A] — a < 
J1 < E[A] + a. Again, if a = 0(/cy / log( 1/e)), we output the empirical distribution after taking 
0(k^\og{\/e)/e 2 ) samples. Our upper bound on the 1/2-norm of /c-SIIRVs (Lemma IC.3P implies 
that this step gives an e-accurate hypothesis. This allows us to assume that o = ^(fcylog/l/e)). 
We will assume this throughout the remainder of our analysis. 

Once again, under these assumptions, we can use Bernstein’s inequality to prove concentration 
bounds for X : 


Lemma 2.17. Suppose that a > Ofcy / log(l/e) J for C sufficiently large, then for all t > 0 we have 
that 

Pr(|A — fi\ > (2 + t)a ) < exp(—0(f 2 )) + e 2 . 

Proof. We assumed that \fi — Jl\ < a. So, if \X — Jl\ > ( 2 + t)a ), then \X — p,\ > (1 + t)a. Bernstein’s 
inequality gives that 


Pr(X - E[X] > (1 + t)a) < exp 


/ -1(1 + t) 2 a 2 \ 

{ Var[A] + \k) ' 


Since a = ^(fcyTog^/e)) = Pl(k) and Var[A] = 0(cr 2 ), we have that Var[X] + = 0(cr 2 ). □ 

In particular, this implies that with probability 1 — 2e 2 that \X — Ji\ = 0{a y / log(l/e)). Next, 
we will recall concentration bounds on the Fourier transform of X. To do so, we first devise some 
notation. Let X = Xi , where A* are independent /c-IRVs. We let pij be the probability that 
two independent copies of A/ have absolute difference j. We let Vj = Y^=\ Vi,j- I n terms of this, 
we restate Equations (JT]) and (O from the proof of Lemma 12.31 as 


k -1 


1^(01 = ex P [ 

K0 =1 


and 


fc-i 

Var(A) = . 

3 = 1 

Finally, we note that we can find some particular good scale to consider. In particular, we note 
Lemma 2.18. There exists an m E [k] so that Y^= m v j = ^(^/A:) 2 . 

Proof. We assume for sake of contradiction that this is not the case. We have that 


2m 

v i < °(x / k ) 2 » 

j=m 
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where c is a sufficiently small constant. This implies that 

2m 

j 2 Vj < c{2am/k) 2 . 

j=m 

Summing over m powers of 2 less than or equal to k, we find that 


k Li°g 2 0)J 

yy j 2 Vj < c (2s /k) 2 A 1 < c4 log2 ^ +1 (4a 2 / k 2 ) = 16c<r 2 

j =i *=l 

However, we know that 

k 

= Var(X) = 0(5 2 ) . 
i=i 

This yields a contradiction for c sufficiently small. 


□ 


Our algorithm will begin by guessing a value for m. We assume throughout the following that 
m represents such an integer. Furthermore, we assume that our algorithm has guessed values 
w m ,w m+ 1 ,... ,it>2m SO that Wi < Vi for all i and Yjf=m w j = ^(v/k) 2 . 

Claim 2.19. Given m and a, this can be done by considering only 2°^ possible vectors of vj’s. 

Proof. By Lemma f2.181 we have that Yl]= m v j = Q(&/k) 2 . Suppose concretely that C' is a constant 
such that we always have Ylfj=m v j — C r (a/k) 2 . Then, we claim that there is some set of non¬ 
negative integers aj. for m < j < 2m such that Y^j=m a j = m an d v j — ( a j/( m + 1)) • C'(a/k) 2 /2. 
In particular, take aj = [ J then \(C'(a/k) 2 /2{m + 1)) — v j\ < C'(a/k) 2 / 2, and 

so V 2m n ■ > hj^C^/fc) 2 / 2 ! >11 

SO l—jj=m a J ^ C'{a/k) 2 /2(m+l) - 171 + L - 

If we guess such integers, then we can set Wj = ( aj/(m+ 1)) • C'(a/k) 2 /2 and have v 3 > Wj and 
Yl< 2 j=m w i — C'fa/k) 2 /2 = Q(a/k) 2 . There are ( n ^ k ) ^-vectors a of non-negative integers summing 
to n. So in our case, there are ( 2 ™) < 2 2m < 2 2k possible combinations of wj. □ 

We then have that 

( / 2m 

-0 I £ WjUtf 

\j=m 

We have the following simple lemma about B: 

Lemma 2.20. If\£ — £'\ < l/(6m), then 

B(£)B(?) = exp(-H(5 2 (£ - f') 2 m 2 /k 2 )). 


Proof. Firstly, we show that for each m < j < 2m, either we have [j£] > j |£ — f'\/2 or [j£ ; ] > 
j\f — f,'\/2. If [j£] < j\£ — £'|/2, then there is an integer i such that \jf — i\ < \jf — jff |/2 and so 
I > liC-jC'l - \j£~i\ > liC — jC'l/2- But we also have \jf -i\ < \j£- j£'| + \j£- i\ > 3| - 

jf |/2 < 3j/12m < \. So, i is still one of the closest integers to jff and [jf 1 ] = \jf' — i\ > |j£—j£'|/2. 
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Thus, we have: 


2 m 


B(f)B(f) = exp I -0 I + \j€ 

\ \j=m 
/ / 2 m 

<exp -IM ^ W]f{i - f'T 


\3=m 

2 m 


<exp I -n I (^2 u>j)m 2 (f - f'f 

\ \ j=m 

<exp(—H(a 2 (f — f') 2 )m 2 /k 2 ) , 


where the final line follows since we guessed w so that Yl‘j=m' w j = Q{&/k) 2 - □ 

This implies that within each interval of length l/(6m), B(f) is bounded by an appropriate 
Gaussian. In particular, for 0 < i < 6m, let Ii be the interval [i/6m, ( i + l)/6m], and let fi be the 
element of Ii at which B is maximized. Since w j [.?'£] 2 * s a piecewise quadratic, we can easily 

calculate its minima fi on each Ii given Wj for m < j < 2m. As a corollary of the above, we have: 

Corollary 2.21. For f € Ii, we have that 


|X(£)| < exp(—fl(cr 2 (^ - fi) 2 )m 2 /k 2 ). 

Proof. We can write |X(£)| < B(f) < y 7 B(f)B(ff) = exp(—fl(<x 2 (£ — fi) 2 )m 2 /k 2 ) . □ 


From this point onwards, our analysis is nearly identical to that from the previous subsection. 
We will need the function 1(f) to be small not just near 0, but also near all of the ffs so that F 
will be close to 1 on the effective support of X. This has the effect of making its inverse Fourier 
transform a sum of O(rri) Sine functions rather than a single one. This in turn will increase the 
size of the F by a factor of m, which is where the final additional factor of k in our sample size 
comes from. 

Our algorithm depends on taking the empirical Fourier transform of X and truncating it in 
a judiciously chosen way. Let G(f) be a Gaussian of standard deviation 1/a taken modulo 1. In 
particular, 


o«) = £ 

nEZ 


1 „~Z 2 (n+t) 2 /2 

V 27t /° 2 


Let 1(f) be the indicator function that is 1 if and only if f is within Cka 1 y / log(T Je)/m of one 
of the fi modulo 1, for C a sufficiently large constant. Let F be the convolution of I and G. As 
before, F approximates I in that: 


Claim 2.22. (%) F(f) € [0,1] for all f. 

(ii) F(f) > 1 — e 2 /k for f within (Ck/m — 3)ff _1 yTog^/e) of some fi. 
(Hi) F(f) < e 2 /k for f not within (Ck/m + 3)ff _1 yTogfl/e) of any fi. 
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Proof. Note that F is the convolution of I and G. I(x ) is the indicator function of some set T. 
Explicitly, we have: 


m = J T G(v)dv<J^ G{u)du = jH -J==e-*WI 2 dv = 1. 

This gives (i). 

For (ii), we note that since / contains the interval [£j— Ckd -1 ^/\og(l/e)/m, t^i+Ckd^ 1 yTog(l/e)/ m\, 
we have 

_ /•?-Si+CfeCT- 1 - v /log(l/e)/m ~2 /r i \2 /o 

F(0> / ,_ * e - 2 (M) 2 /2 dzA 

J£—€i—Ck<r- 1 y/log(l/e)/m v2vr 

Since |£ — £j| < (Ck/m — 3)cr _1 y / log(l/e), this interval contains [—35 : ^ 1 i v /log(l/e), 35 : " 1 y / log(l/e)], 
and so 

_ />35- 1 v / log(l/6) 

i^) > / ,_ -/= e_<7 " /2 ^ > 1 - 0(e 3 ) > 1 - 6 2 A , 

J3ff _1 -y/log(l/e) V 2vr 
by standard bounds on the Gaussian. 

For (iii), we note that T is disjoint from the set [£ — 3S : ~ 1 ^/log(l/e),£ + 3cr _1 y / log(l/e)]. We 
have 


%) = / 
J V 


^ (mod Z)gT \/27T/ 


1 =e - CT2 H 2 /2 dl/ 


< 


'M>3ct 1 \/iog(i/e) \/2vr /a 


l=e* a M 2 l 2 dv = 0(e 3 ) < e 2 /fc. 


□ 


Our algorithm is now quite simple to state and works as follows: 

1. Let Z be the empirical distribution and Z be the Fourier transform of Z. 

2. Let Y be the pointwise product of Z with F. 


3. Let Y be the truncation of the inverse Fourier transform of Y to 
for C a sufficiently large constant. 


P ~ Ca^ log(l/e), n + Cay/ log(l/e) 


Both to aid in the performance of this computation and in its theoretical analysis, we note another 
way to obtain the same answer. As Y is the truncation of the inverse Fourier transform of a 
pointwise product of Z and F, we may instead write it as the truncation of the convolution of Z 
and F, the inverse Fourier transform of F. As F is a convolution of functions, F{x) is the pointwise 
product of G(x) (a Gaussian of standard deviation 0(a), normalized to have size 1 at the origin) 
with S(x), an explicit combination of Sine functions. Note that F can be computed explicitly, and 
thus this convolution can be computed in polynomial time. 

In order to analyze the correctness, we will need to introduce a new distribution, Y'. We let 
Y' be the truncated inverse Fourier transform of the pointwise product of F with X (note that Y 
differs by using Z instead of X). We begin by showing that dTv(X, Y') is small. To do this, we let 
Y' = XF. 


Claim 2.23. We have that 

iX-Wli = 0(e 2 Vlog(l/e)/3 : ). 
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Proof. We similarly use the fact that 


*(o-no=*(0(i-*m 


If [£ — £$] is at most ( Ck/m — 3 )a 1 y / log(l/e) for some i, the above expression has absolute value 
at most e 2 /k because 1 — F(£) does. Otherwise, it has absolute value exp(—I7 ((t 2 [£ — U 2 m 2 /k 2 )), 
where £j 0 is such that *o G argminj^ — £*]. Next, we combine these bounds and integrate. 

We consider intervals [a,, bf\ with bi = a^+i for 1 < i < 6 m and b§ m = ai + 1 such that € [a*, bf\ 
and for any x + Z in [a*, 6j] + Z is at least as close to £* + Z than to any fj + Z for j / z. 


IX - 


^____ r^Qm ^ 

W| 1 = / |X(0-W(0I^ 

o ai 


/ /*max{£i — (Ck/m— 3)<r 1 \og(l/e),ai} 


T. 


|V({) - i"K)|cif 






+ 


+ 


L 


min{£i+(Ck/m-3)cr 1 ydog(l/e),&i} 


max{(j- (Ck/m— 3)cr _1 -y/log(l/e),a;} 
bi 


|X(£)-W(£M 


f 


6m 


|v(0 - y'KM 


sod)-E 


i=l 


Pr 

W^N(0,cr~ 2 k 2 /m 2 ) - 


\W\ > ( Ck/m — 3)a 1 y / log(l/e) 


+ (e 2 /fc) • 2(Ck/m — 3)er 1 y / log(l/e) 
<0(e 2 v / !og( 1 /e)/cl) . 


This completes the proof. □ 

Taking an inverse Fourier transform implies that \X — Y'^ = 0(e 2 log(l/e) fa), at least within 
the domain of truncation. Since this domain has size 0(a y / log(l/e)), we have that the L\ error 
between X and Y' within this domain is 0(\J log(l/e)e 2 ). However, both X and Y' have at most 
0(e 2 ) mass outside of this domain, and therefore we have that 

d TV (X,Y') = 0(e 2 log(l/e)). 

It remains to bound d tv(Y,Y'). In particular, we will show that it has expectation 0(e). Then, 
by decreasing e by a constant factor and applying Markov’s and triangle inequalities, we will have 
that c?tv(X, Y) < e, with probability at least 2/3. 

Proposition 2.24. We have that E[drv-(Y, Y 7 )] < 0(e). 

Proof. Recall that Y is a the convolution of Z with F. If we consider our samples to be random 
variables lm,... ,X(jv) each of which is an i.i.d. copy of X, we can express Y(p) for a given p as 
a random variable: 

1 N 

y w = v£ F(p -*<‘> ) ’ 

1=1 
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for a < p < b, where a = p — Ca^/\og{\/e) and b = p + C a ^/\og(l / e). Note that the expectation 
of Y (p) is 

1 N 

- £ E.v [F(p - X )] = E x [F(p - X)] = Y'{p). 


i— 1 


Therefore, we have that E[|y(p) — Y'(p)\] = 0(y/Yar(Y (p))). We bound the variance as follows: 


{F{p -X)-Y J F(p-q)X(q)f 


/N 


Var [Y(p)\ = Xai[F(p - X)\/N = E 
= ^(X(r)/iV). (F\p-r) 

r=a \ 

b 

+ E ( F ^P - ~ 2F (P ~ r ^ F (P - ?)*(?) +2 £ F(p - q)F(p - q , )X(q)X(q')) 

q =a V I'?* ', 

b b 

=1/N • £ F\p - q)(X(q) - X(q) 2 ) < (1/N) • £ F 2 (p - q)X(q) . 


q=a 


We have that 

E n\Y(p)-Y'(p)\) 

pe[p,+ta,fi+(t+ 1 ) 5 ] 




o(VW) T F 2 {r) ^2 X(q) (by Cauchy-Schwarz) 

V r q£[fi+tcr—r,p+(t+l)a—r\ 


= 0(y/v/N)l^2 F 2 {r) exp(— n(\ta - r\/a ) 2 )) 

= 0{^ajN)^Y^S{r) 2 exp(-Q(((ta - r)/a) 2 + (r/5) 2 )) 
= 0(y/Z/N)^J2 S (r) 2 exp(-0(f 2 )) 

= 0{^a/N) exp{-Q(t 2 )) 1^2 S{r) 2 

= 0(y/a/N) exp {-tt(t 2 ))J ^/(£) 2 


(by Lemma f2.17li 


(by Plancherel’s Theorem) 


= 0(\fa~jN) exp(-Q(t 2 ))^/ka l \f\og(l/e) 

= 0{k 1/2 log 1/4 (l /e)/VN) exp(-0(t 2 )). 

Summing the above over f gives that 

E[d w (Y,F')] = 0(A: 1/2 log 1 / 4 (l/e)/ViV) = 0(e) , 
for iV = /c-yTog(l/e)/e 2 . This completes the proof. 


□ 
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3 Cover Size Upper Bound and Efficient Construction 

We start by establishing an upper bound on the cover size and then proceed to describe our efficient 
algorithm for the construction of a proper cover with near-minimum size. To prove the desired 
upper bound on the size of the cover, we proceed as follows: We start (Section 13.ID by reducing the 
cover size problem to the case that the order n of the USIIRV is at most poly(A;/e). In the second 
and main step (Section 13.2[) . we prove the desired upper bound for the polynomially sparse case. 
Our efficient algorithm for the cover construction (Section 13.3|) is based on dynamic programming 
and follows a similar case analysis. 


3.1 Reduction to Sparse Case Our starting point is the following theorem: 

Theorem 3.1. l\DDO + l$j . Theorem 1.2] Let P G S n ^ be a k-SIIRV of order n. Then, for any 
e > 0 ; P is either 

1. a distribution with variance at most poly(/c/e); or 

2. e-close to a distribution P 7 such that for a random variable X ~ Pf we have X = cZ + Y 
for some 1 < c < k — 1, where Y, Z are independent random variables such that: (i) Y is 
distributed as a c-IRV, and (ii) Z is a discretized normal random variable with parameters 
f where p = E[A] and a 2 = Var[X]. 

The above theorem allows us to reduce the problem of constructing an 0(e)-cover for S n to 
the problem of constructing an e-cover for S n i k, where n' = poly(/c/e). Indeed, given an arbitrary 
USURY P G S U) k we proceed as follows: If P belongs to Case 1 of the above theorem, then we 
show (Lemma 13.21) that there exists a translation of a /c-SIIRV with n' = poly(fc/e) variables that 
is e-close to P. We show in the following subsection (Proposition 13.31) that S n ck admits an e-cover 
of size (l/e) 0 ^ 10 ^ 1 / 6 )). Since there are 0[kn ) possible translations, this gives a 2e-cover of size 
n (l/e) 0 (fc log (Ve)) for £_snRVs in Case 1. 

Moreover, it is not difficult to show that there exists an e-cover for distributions in Case 2 with 
at most n ■ ( k/e)°^ points. In particular, we claim that for distributions in sub-case 2(i) there 
exists an e-cover of size (l/e)°^ k \ and for distributions in sub-case 2(ii) there exists an e-cover of 
size 0(n ). Assuming these claims, the sub-additivity of total variation distance (Proposition IA.31) 
implies that distributions in Case 2 have a 2e-cover of size n ■ (1/eas desired. 

Note that the random variable Y in Case 2(i) is distributed as a /c-IRV, i.e., it has support 
k. It is well-known and easy to show that the set of ah distributions over a domain of size k has 
an e-cover of size (l/e)°( k \ It remains to show that we can e-cover the set of discretized normal 
distributions of Case2(ii) with 0{nk/e) points. To do this, we exploit the fact that the variance 
of such distributions is large. Let cr m in = H(/c 9 /e 3 ) be the minimum variance of a /c-SIIRV X in 
Case 2. Note that the discrete Gaussian in Case 2 has a variance of Var[X]/c 2 . Hence, we want to 
e-cover the set of discrete Gaussians with standard deviation a in the interval [cr m j n , cr max ], where 
(j m ax = 0(y/nk), and mean value p in the interval [0, n{k— 1)]. Consider the following discretization 
of the space (ex 2 , p): We first define a geometric grid on a 2 with ratio (1 + e), i.e., of = of lin (1 + e) 1 , 
where where 0 < i < i max and i max = 0(( 1/e) • log(n)). For every fixed i, we define an additive grid 
on the means, so that \pj+i — pj\ < e • o^. A combination of Propositions IA.21 and lA~4l implies that 
this grid defines an e-cover. Note that the total size of the described grid on {cr 2 ,p) is 


E 

i =0 


n(k — 1) 


e • ov 


*max 

= E 

i =0 


n(k — 1) 


°’min(l T e)L- 


0{n), 


where the last inequality follows from the lower bound on o'min and the elementary inequality 
E,(l + e)-‘ /2 =0(l/ e ). 
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The following lemma completes our reduction to the n = poly(A:/e) case: 

Lemma 3.2. Let P € S n ^ be a k-SIIRV with Varx~p[AT] = V. For any 0 < <5 < 1/4, there exists 
Q € S nt k with dxy(P,Q) = 0(SV) such that all but 0(k + V/S) of the k-IRV’s defining Q are 
constant. 

The proof of Lemma 13.21 is deferred to Appendix ID. 11 Note that an application of the lemma 
for 6 = e/V completes the proof. 

3.2 Cover Upper Bound for Sparse Support In this subsection we prove the desired upper 
bound on the cover size for the sparse case: 

Proposition 3.3. Fix arbitrary constants c,C> 0. Consider n,k,e satisfying e < k~ c and n < 
{k/e) c . Then there exists an e-cover of S n> k under d^y of size (l/e) 0 c,c(fciog(i/e)). 

Our proof proceeds by analyzing the Fourier transform of the probability density functions of 
/c-SIIRVs. We will need the following definitions. 

Basic Definitions. For £ £ R, recall that we use the notation e(£) *= exp(—2-7ri£). For a probability 
distribution P over Z, its Fourier Transform is the function P : [0,1) —>• C defined by P(£) = 
IEy~p[exp(—27riy£)] = Ej/~p[e(y£)]. Note that Parseval’s identity states that for two pdf’s P and Q 
we have llP-Qlh = IIP-QII 2 . In our context, P and Q are going to be supported on a discrete 

1 /O /N /N. 

set A, in which case we have l|P-Qlb = (E oe A(P(«) - Q( a)) 2 ) 7 . On the other hand, P and Q 

are Lebesgue measurable and we have ||P — Q ||2 = ^/q |P(£) — Q(£)| 2 d£^ 

An equivalent way to view the Fourier transform is as a function defined on the unit circle 
in the complex plane. For our purposes, we will need to analyze the corresponding polynomial 
defined over the entire complex plane. Namely, we will consider the probability generating function 
P : C — > C of P defined as P(z) = E<^p [z 2 ']. Note that when \z\ = 1, this function agrees with the 
Fourier transform, i.e., P(£) = P(e(£)). 

At a high-level, our proof is conceptually simple: For a fc-SIIRV P, we would like to show that 
the logarithm of its Fourier transform logP(£) is determined up to an additive e by its degree 
0( log(l/e)) Taylor polynomial. Assuming this holds, it is relatively straightforward to prove the 
desired upper bound on the cover size. Unfortunately, such a statement cannot be true in general 
for the following reason: the function P(z) may have roots near (or on) the unit circle, in which case 
the logarithm of the Fourier transform is either very big or infinite at certain points. Intuitively, 
we would like to show that the magnitude of P(z) close to a root is small. Unfortunately, this is 
not necessarily true. 

We circumvent this problem as follows: We partition the unit circle into 0(k ) arcs each of 
length 0(l/k). We perform a case analysis based on the number of roots that are close to an arc. 
If there are at least fi(log(l/e)) roots of P(z) close to a particular arc, then we show (Lemma l3.5f i)) 
that the magnitude of P(z) within the arc is going to be negligibly small. Otherwise, we consider 
the polynomial q(z ) obtained by P(z) after dividing by the corresponding roots, and show that 
log q(z) is determined up to an additive e by its degree 0(log(l/e)) Taylor polynomial within the 
arc (see Lemma 13.61) . Using the aforementioned structural understanding, to prove the cover upper 
bound, we define a “succinct” description of the Fourier Transform based on the logarithm of q(z) 
and appropriate discretization of 0(log(l/e)) nearby roots. 

Note that we take advantage of the fact that our distributions are supported over a domain of 
size t = poly(fc/e), in order to relate their total variation distance to the L^ distance between their 
Fourier transforms. In particular, we have the following simple fact: 
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Fact 3.4. For any pair of pdfs P, Q over [T\, we have ||P — Q||i < V l + 1||P — Q 

Indeed, note that ||P — Q || 1 < \/F+T ||P — Q 11 2 = + 11| P — QH 2 < + 11| P — QHooi where 

the equality is Parseval’s identity. 

For the rest of this section we fix an arbitrary P € S n k and analyze the polynomial P(x). We 
start with the following important lemma whose proof is deferred to Appendix ID. 21 

Lemma 3.5. Fix x € C with |x| = 1. Suppose that are roots of P(x) (listed with 

appropriate multiplicity) which have \pi — x\ < Jr. Then, we have the following: 

(i) |P(x)| < 2~ m . 

(ii) For the polynomial q(x) = P (x )/ WfLiir ~ Pi)> we have that |g(x)| < k m . 

Our main lemma for this section shows that we can e-approximate the Taylor series of q(x) by 
only considering the first 0( log(l/e)) terms: 

Lemma 3.6. Fix w € C with |tu| = 1. Suppose that pi,...,p m are all the roots of P(x) (listed with 
appropriate multiplicity) which have \pi — w\ < Let q(x) = and let the Taylor series 

of ln(q(x)) at w be lng(x) = Yl’jLo c j( x — w y • Then, we have that \cj\ < nk(3k) J , for all j > l, 
and the real part of cq is at most mink. 

Fix 0 < e < l/(12mA;) and an integer l satisfying i > log(9nfc). For p'- with \p'- — Pj\ < e for 
j € {1,..., m}, and c'- with \d- — Cj\ < e for j € {1, ...,£} we have: For all x € C with |x| = 1 and 


P(x) 



exp 



< O (emk + nk2 ' . 


( 6 ) 


Proof. We start by noting that, by the triangle inequality, Lemma [3751 applies to all points 16 C 
with |x| = 1 and \x — w\ < Observe that co = ln[g(u>)] and by Lemma l3?5lf ii) |<?(ic)| < k m . This 
gives the claim on the real part of Co- 

Note that ln(g(x)) can be expressed as a sum of the form 


R 

In {q(x)) = c 0 + ^2 M 1 ~ ( x ~ w )/( r h ~ w)) , 

h= 1 


where cq = ln[q(w;)], r 3 are the roots of q(x), and R < n(k — 1) is the degree of q(x). By the 
definition of q, it follows that |r^ — w\ > ^ for all 1 < h < R. 

Inserting the standard Taylor series ln(l + y) = Y^°jLo gives 


R 00 

In (q(x)) = c 0 + EE 

h= 1 j=0 


(— l)i(x — w)i 
j ■ (r h - wy 


Considering the (x — w)i term above gives Cj = ( ^ Ylf=i( r j ~ w ) 3 ■ Therefore, 

\cj\ < R(3ky < nk(3ky . 

This gives the desired bound on |Cj \, j > 1. 
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We now proceed to prove ([6]). We start by considering the difference 


t 

Ydj{x-w) 3 -ln(g(x)) , 

3= 0 


for x in the appropriate range. Since \x — w\ < < 1/2 and |cl- — Cj \ < e, we have 




£■ 


x — w)° — y Cj{x — u>) J 


< e 


.^ 2 -J< 2e . 


3=0 


3=0 


3=0 


So, we need to consider the error introduced by truncating the Taylor series after the first l terms. 
We have 


e 

Y^ c ji x — W Y ~ ln(g(x)) 

— 

M 

1 

Vo. 

i =o 


j>£ 


< Y^ n k(3ky(6k) 3 

j>l 

= nk2~ e 


Therefore, by the triangle inequality, 


l 

Y c ']( x - w y 


3=0 


log(g(x)) 


< 2e + nk2 ^. 


Thus, the multiplicative error in this approximation, i.e., 


1 

q{x) 


exp 



1 

P(x) 


\ 

Ylix-pj)) 


exp 



wy 


is exp(Fl), where |£j < 2e + n£;2 £ . Since |P(x)| < 1 and by our assumptions on £, 2e + nk2 £ < 1, 
we have that 


p (T) - 11^ “ Pi) exp J2 C 3^ x ~ W Y 


U =1 


vi =0 


<e-{2e + nk2~ l ). 


We next replace each pj by the corresponding p'- one at a time. By a simple induction, we will 
show that for all 1 < h < m 


p («) “ 11^ “ Pj)] \ fl(»- p i) exp ^ c j( x “ W Y 


\j=i 


\j=h +1 


vi=o 


< e • (2e + nk2 e ) + Ahke . (7) 


We have just shown this for h = 0. So, we assume (0 for 0 < h < m — 1 and seek to prove it for 
h + 1. For simplicity, we rewrite dTJ) as 


P(x) — (x — ph)fh(x ) < e • (2e + nk2 e ) + 4hke , 
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where f h (x) = (nj=i - Pj)) (I13m»+i( x - Pj)) ex P (E^o^O* “ W ) J )- 
Note that the RHS of ([7]) satisfies 

e • (2e + nk2~ l ) + Alike < e • (2e + nk2~ e ) + 4?nfee < 1 , 

by our assumptions on e and £. Since |P(x)| < 2 _m < 1, we have |(x — ph)fh(x) I < 2 or \fh(x)\ < 
\ x 2 ph \ < 4/c. Now if we replace (x — Ph)fh( x ) with (a; — p' h )fh(x), we introduce an error of |(x — 
Ph)fh(x) ~(x- p/ h )fh{x) | = 1*4 - /5h||/h(x)| < e • 4fc. Hence, 


P(x) — (x 


p h )fh(x) <e-(£e + nk2 e ) + A(h + l)ke 


But this is just (JTJ) for h + 1, completing the induction. 
Taking h = m in (J7|) gives: 


P(x) 


\ 

W^-p'A 


exp 



< e • (2e + nk2 £ ) + Amke 


as required. □ 

We are now prepared to prove Proposition 13.31 

Proof of Proposition \3.3[ By replacing e by a power of itself, we may assume that e < k _1 and that 
n < e _1 . We may additionally assume that e is sufficiently small. 

It suffices to find a subset T of S n ^ of appropriate size so that for any P £ S n ^ there is some 
Q £ T so that |P(z) — Q(z)| < e 2 for all \z\ = 1, as Fact 13.41 would then imply that dTV"(Pi Q) < e - 

We begin by defining some parameters. Let m be an integer larger than 31og(l/e). Let l be an 
integer larger than log(nfc/e 3 ) and 5 > 0 a real number smaller than e 3 /(mfc + £). Additionally, we 
divide the unit circle of C into 0(k ) arcs each of length at most 1/(3 k). 

To each P £ S n ^ we associate the following data: 

• For each arc in our partition with midpoint wj, define q(z) as in Lemma [3.61 Then we define 
Pj as follows: 

— If P(z) has at least m roots within distance 1/(3fc) of wj or if \q(wi)\ < e 3 exp(— nk), we 
let P/ = Small. 

— Otherwise, we let P/ consist of the following data: 

* Roundings of the roots of P(z) that are within 1 /(3fc) of wj to the nearest complex 
numbers whose real and imaginary parts are multiples of 5/2. 

* Roundings of the first i Taylor coefficients of log(g) about wj to the nearest complex 
numbers whose real and imaginary parts are multiples of 5/2. 

We then let D(P) be the sequence {P/}/ an arc in the partition- For each value V that can be 
obtained as D( P) for some P £ S n ^, we pick one such P called Q y. We define our cover T to be 
the set of all such Q y. In order to show that this is an appropriate cover, we need to show two 
claims: 

1. The number of possible values of D( P) is at most (l/e)°^ log ^ 1//£ ^ . This implies that |Tj is 
appropriately small. 
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2. If P, Q e S nt k have D(P) = D( Q), then c/ti/(P) Q) < e. This will imply that T is a cover, 
since given any P € S Uj k, we may take Q = Qd(P) £ T. 

The first claim is relatively straightforward. For each of 0{k ) arcs, /, we have that P/ is either 
Small or a sequence of 0(log(l/e)) complex numbers, each of which can take only poly(l/<5) many 
possible values. Thus, the number of possible values for P/ is at most OC^sf 1 / 6 )) = (1/e)°( lo s( 1 / e )). 
The number of possible values for D(P) is at most this raised to the number of arcs, which is 

(1 /e)°( fc log ( 1 / e )). 

The second claim is slightly more involved. We note that it is sufficient to show that if -D(P) = 
D( Q), then |P(z) — Q(z)| < e 2 for all unit norm z. In particular, we show the stronger claim that 
for any of our arcs / if P; = Q/, then |P(z) — Q(z)| = 0(e 3 ) for all z € I. 

If P/ = Qj = Small, we claim that |P(z)|, |Q(z)| = 0(e 3 ) for all z & I. It suffices to show this 
merely for P. On the one hand, if P(z) has more than m roots near wj, this follows from the first 
part of Lemma 13.51 On the other hand, if \q(wi)\ < e 3 exp(— nk), then for any other z € / we have 
that 

q{z) = q(wi) exp a(z - w;) ! j , 

where by Lemma 13.61 |cj| < nk(3k) z . Therefore, for z € I, since |z — wj | < l/(6/c), we have by 
Lemma 13.51 that 

|P(z)| < \q(z)\ < \q(wj)\ exp(nfe) < e 3 . 

If P/ = Q; / Small, we note by Lemma [3761 that for z € / that both of P(^) and Q(^) are 
within 0(mk5 + £5 + nk2~ e ) = 0(e 3 ) of nfli( z “ /°j) ex P (e' =0 dj(z — w,'A , where the p' rj are the 
roundings of nearby roots and c'- the roundings of the Taylor coefficients given by the data pi = qi- 

Thus, again in this case, |P(*)-Q(*)I < 0(e 3 ) for all z £ I. 

This completes the proof of Proposition 13.31 

□ 

3.3 Efficient Cover Construction In this section, we give an algorithm to construct a near¬ 
minimum size cover in output polynomial time: 

Theorem 3.7. Let n, k be positive integers and e > 0. There exists an algorithm that runs in time 
n (fc/e) 0 ^ 10 ^ 1 / 6 ^ and returns a proper e-cover for S n k, i.e. ; a cover consisting of n(k/e)°^ klog ^ 1 ^ e ^ 
k-SIIRVs each given as an explicit sum of k-IRVs. 

Our algorithm builds on the existential upper bound established in the previous subsections. 
We first construct an e-cover for £:-SIIRVs in Case 2 of Theorem EH he., £;-SIIRVs whose variance 
is more than a sufficiently large polynomial in k/e. By Theorem 13.II each such fc-SIIRV is e-close to 
a random variable of the form cZ + Y. where 1 < c < k — lisan integer, Z is a discrete Gaussian 
and Y is a c-IRV. In Section l3Tl we exploited this structural fact to construct a non-proper cover for 
/c-SIIRVs in this case. We remark that this non-proper cover may contain “spurious” points, i.e., 
points not close to a large variance fc-SIIRV. Efficiently constructing a proper cover without spurious 
points for the high variance case requires careful arguments and is deferred to Appendix ID. 31 

We now focus our attention to Case 1. By Lemma 13.21 we have that all such /c-SIIRVs can be 
approximated by a constant plus a sum of poly(fc/e) fc-IRVs. Since there are only nk possibilities 
for this constant, and all such possibilities are easily obtainable, it suffices to find an explicit e-cover 
for S n ^ when n = poly (/c/e). 

A simple but useful observation is that we can round each coordinate probability for each of our 
A:-IRVs to a multiple of e/{nk ) and introduce an error of 0(e) in total variation distance. Therefore, 
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it suffices to find a cover of S' n k , a sum of n = poly (k/e) independent L-IRVs, where each of their 
coordinate probabilities is a multiple of for some integer N = poly(fc/e). We will henceforth call 
such a £:-IRV N-discrete k-IRV. 

Our main workhorse here will once again be Lemma 13.61 The cover we construct will be 
much the same as in Proposition 13.31 but we will now explicitly produce SIIRVs that obtain every 
possible value of D. Fortunately, the Taylor series of the log of the Fourier transform is additive in 
the composite L’-IRVs, and so there exists an appropriate dynamic program to solve this problem. 

Let 5 > 0 be given by a sufficiently small polynomial in e/k, and let m be an integer at least 
a sufficiently large multiple of log(l/e). We divide the unit circle into arcs I with midpoints wj as 
described in the proof of Proposition 13.31 For any iV-discrete &-IRV, P, we associate the following 
data. For each interval I, let pij ,..., p ri ,i be the roots of P that are within distance l/(3k) of wj, 

and let q(z) = UU-}.-, p • For 1 < j < ry, let p’- r be a rounding of pjj with p'-,I = (a + bi)5 for 
some a, b € Z and \p) r — pjj\ < 5. For 1 < j < m, let d- 7 be a rounding of Cjj with d- j = (a + bi)5 
for some a, 6 € Z and \c( I — cy/ | < 5, where the c k j are the coefficients of first m + 1 terms of the 

Taylor series In q(z) = c j( z ~ w iV ■ Let P/ be the data consisting of the list (p\ 7 ,..., p' ri j) 

and the vector (d 0 r . c\ r ,, c' m : ). We let D(P) be the sequence of P/ over all intervals I. 

Given a sequence Pi, P 2 ,..., P h of /c-IRVs, we let D(Pi,..., P^) be given by the following data 
for each I: 

• The first m elements of the concatenation of the lists of approximate roots of Fill p *(-) near 

Wj. 

• The list of elements Xl=i c j /( p *) f° r 0 < j < m, with the exception that the j = 0 term 

is replaced by —00 if for any h' < h we have that the real part of Yli =1 c o /( p i) is l ess than 
—nk — m—mlnk. 

Our algorithm will follow from three important claims: 

Claim 3.8. We have the following: 

(i) D( Pi,...,Pft) can be computed in poly (k/e) time from D(Pi,... ,Ph~i) and D( P/ t ). 

(ii) There are only possible values for D( Pi,..., P^) for any h < n. 

(in) If D( Pi,...,P„) = D(Qi,...,Q n ) and P,Q are the distributions of and Yi 

for Xi rsj Pi and Yi Q i then d^y (P, Q) < e. 

Proof. The first statement follows from the fact that the lists of roots in D(Pi,... , P/J are obtained 
by concatenating those in D(P\, ..., P^-i) with those in D(Ph), and truncating if necessary. 
And moreover that Hli c j/( p *) ^ obtained by adding cf r (Ph) to d- r (Pi) (with the term 

remaining —00 if it was in D(P \,..., P^-i)). 

For the second statement note that for each of the 0(1 ) intervals, we store 0( log(l/e)) complex 
numbers whose real and imaginary parts are each multiples of 6. As each of these numbers (with 
the exception of a —00 term) have size at most poly (k/e) and 5 = poly(e/fc), there are only 
poly(fe/e) 0 ^ log ^ 1 / e h many possible values for -D(Pi,..., P/J. 

The third statement is true for essentially the same reasons as in the proof of Proposition 13.31 
Once again, we simply need to show that for each interval I it holds |P(*)-Q(*)I < (e/k) c for all 
z € I and c a sufficiently large constant. Note that the listed roots are simply ^-approximations 
of the (first m) roots of P and q within distance 1 / (3fc) of wi, and the (Cli c ' 3 ?( P *) are within 
distance n5 of the coefficients of the Taylor expansion of the logarithm of q(z) about wi. If we have 
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m nearby roots, both P and Q are small for all z in this range. Otherwise, unless there is a —oo 
in D( P) = D{ Q), they are close by Lemma 13.61 If we do have a — oo then 

( k ' \ 

^^ c o,/(P*) <—nk — m—m Ink 

for some h! < h. Since the later co,/(Pi) and cqj m have »co,/(Pi) ^ niilnk and 5?co j /(Pj) < 
rrii In k by Lemma 13.61 this means that \q(wi)\ < e~ m e~ nk , and as in Proposition 13.31 this implies 
that both P and Q are sufficiently small. □ 

We can now present the algorithm for producing our cover. The basic idea is to use a dynamic 
program to come up with one representative collection of Pi,...,P/! to obtain each achievable 
value of D. The algorithm is as follows: 

Algorithm Cover-SIIRV 

Input: k, e > 0 and n,N = poly (k/e). 

1. Define 6 and m as above. 

2. Let L 0 = {(D(0),0)}. 

3. For h = 1 to n 

4. Let Lh be the set of terms of the form (_D(Pi,..., P^), (Pi,..., P/J) where 
(D(Pi,..., P/^i), (Pi,..., P/ t _i)) € Lh~i and P^ is an IV-discrete fc-IRV. 

5. Use a hash table to remove from Lh all but one term with each possible value of 

D( Pi,...,P fc ) 

6. End for 

7. Return the list of distributions with Xi ~ Pj for each 

(-D(Pi, • • •, Pn), (Pi, • • • j Pn)) € L n . 


To prove that this produces a cover, we claim by induction on h that Lh contains an element 
that achieves each possible value of D(Pi,... , P/J. This is clearly true for h = 0. Given that it 
holds for h— 1, Claim I3.8l i) implies that the non-deduped version of Lh also satisfies this property, 
and deduping clearly does not destroy it. Therefore L n contains (exactly one) element for each 
possible value of D(Pi,..., P n ). Therefore, by Claims I3.8f ii) and (iii), the algorithm will return a 
cover of the appropriate size. For the runtime, we note that the initial size of Lh before deduping 
is the product of the size of Lh -1 and the number of IV-discrete A;-IRVs, which by Claim I3.8l ii) 
is poly(£:/e) fclog( ' 1 / e ). Each of these elements are generated in poly(fc/e) time, and the deduping 
process takes only polynomial time per element. Therefore, the final runtime is poly(fc/e) fclog( ' 1 / e ^. 
This completes the proof of Theorem 13.71 

4 Cover Size Lower Bound 

In this section we prove our lower bound on the cover size of /c-SIIRVs. In Section 14.11 we show 
the desired lower bound for the case of 2-SIIRVs. In Section T4.21 we generalize this construction for 
general &:-SIIRVs. 
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4.1 Cover Size Lower Bound for 2-SIIRVs We start by providing an explicit lower bound 
on the cover size of 2-SIIRVs. In particular, we show the following: 

Theorem 4.1. For all 0 < e < e -42 and n G Z such that 7 < n < | ln(l/e), there is an e-packing 
ofS n ,2 under dpy with cardinality (l/e)^ n \ 

We begin with the following useful lemma: 

Lemma 4.2. Let P and Q be 2-SIIRVs given by parameters pi and qi for 1 < i < n, for some 
n > 7. Suppose that for all i, 1 < i < n, it holds \pi — */(« + 1)| < l/4(n +1) and \q,. — i/(n + 1)| < 
l/4(n + l). Then, 

d TV (P, Q) > max | p { - qf\ ■ e~ 3n . 
i 

Proof. Let e = \pi~qi\e~ 3n . For a distribution P supported on [n], define rp(p) to be the polynomial 


rp(p) = e a ^ p [(p - i) x • P n ~ x ] = y p w(p - i)y^. 

i =0 


For a PBD P G S n ,2 and X ~ P with X = -W f° r ~ Ber(pj), we have that 

rp(p) = E [(p — 1 ) X p n ~ X ] = E Up — l)^i=i Xi • p'SLi=V 1 ~ x i) 


= E 


l) Xl p X 


-Xi 


_i— 1 


Je [(p- i) A V 


-Xi 


2=1 


n n 

= Yl{Pi(p-i) + (i-Pi)p) = Yl(p-Pi) ■ 

2=1 2=1 

Hence, the roots of the polynomial rp are exactly the parameters pi of the 2-SIIRV P G 5 n ,2- We 
have the following simple claim: 

Claim 4.3. Let P, Q € S n , 2 such that dpy (P, Q) < e- Then for any p G [0,1], we have that 


r P (p) ~ r Q (p)| < 2e. 


Proof. We have the following sequence of (in)equalities: 


|r P (p) - r Q (p)| = 


£(p(;)-Q(i))0>-i)V’-‘ 


i =0 


< 


V l( p (') - Q(0)l -10 — 


t=0 


< ^l P W-QWI = 2 ^(P,Q)<2e, 

»=o 

where the second line is the triangle inequality and the third line uses the fact that |(p— l)*p n_t | < 1 
for all i G [n] and p G [0,1]. □ 

Hence, to prove the lemma, it suffices to show that for some p G [0,1] that 

|?’ P (p) - r Q (p)| > 2e. 
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In particular, we show this for p = p^. Noting that rp(pi) = 0, it suffices to show that \r^{pi)\ > 2e. 
We now proceed to prove this fact. If j i we have that, 


1 K - 2 hrj 


Pi 


n + 1 


% 


n + 1 


> 


1 


2 (n + 1 ) 


Therefore, we have that 


\ r Q(Pi)\ = \ pi ~ q i\ - \ pi ~ ' II o( n + 1) ' 

3 = 1 ' 


We note that 


n 


= (* ~ l)!( n — *)! — TrWTT — 
(n + !) (-_i) 


(n/e) n 
2 n ~ 1 ’ 


( 8 ) 


where we use the elementary inequalities n! > ( n/e) n and < 2 n 1 . Applying this to the 

above, we find that 


\ r Q(Pi) I 


I Pi ~ gi| > 2|p» ~ gil 
e • (n + l)(4e) n ~ e 3n 


> 2 e. 


□ 


Proof of Theorem \4-l\ Given e > 0 and n € Z satisfying the condition of the theorem, we define 
an explicit e-packing for S U: 2 as follows: Let s = For a vector a = (ai,..., a n ) € [s] n , let 


Pf = 


n + 1 


+ 


a-iVe 
An ’ 


i € ,n} , 


be the parameters of a 2-SIIRV P a G 5 n , 2 - We claim that the set of 2-SIIRVs {P a } ae j s j„ satisfies 
the conditions of the theorem, i.e., for all a, b € [s] n , a / b implies dpv (Pa, Pb) > e. 

In particular, if a 7 ^ b, then there must be some i so that ai 7 ^ b,. Then, by Lemma 14.21 we 
have that 

/e e 3 / 4 

dTvi Pa,Pb) > I Pt-P*\e~ 3n > ^e- 3n > — > e. 

□ 


As a simple corollary we obtain the desired lower bound: 

Corollary 4.4. For all 0 < e < 1 and n = fl(log(l/e)), any e-cover of S n> 2 under dpv must be of 
size n • (l/e)^ 10 ® 1 / 6 ). 

Proof. We will assume without loss of generality that e is smaller than an appropriately small 
positive constant. First note that if there exists a 3e-packing for S n ^ of cardinality M, then any 
e-cover for S n> 2 must be of cardinality at least M. Indeed, for every Qj, i = 1 ,...,M, in the 
3e-packing, consider the (non-empty) set N e (Qi) of points P in the e-cover with g?tv(Q*;P) < e. 
If P € N e (Qi) and j 7 ^ i, we have div(P, Qj) > 4 tv(Qj, Q i) — ^tv(QoP) > 2e. That is, the sets 
N e ( Qi) are each non-empty and mutually disjoint, which implies that the size of any e-cover is at 
least M. 
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By Theorem 14. 1 L for any 0 < e < e _ 12 /3, if we fix no = ln(l/3e)J, there is a 3e-packing for 
S n 0; 2 of size (l/e)^( lo §( 1 / e )). From the argument of the previous paragraph, any e-cover for <S n0i 2 is 
of size (l/e)^ 10 ^ 1 / 6 )). 

To prove the desired lower bound of n • (l/e)^^ 08 ^ 1 ' 6 ^ we construct appropriate “shifts” of the 
set S no , 2 as follows: Consider the set S Ui 2 where n > r(no + 1) for some r € Z+. For 0 < * < r, 
let S l n 2 be the subset of 2 where *(no + 1) of the parameters pj are equal to 1, and at most no 
other pj’s are non-zero. Note that for i ^ j any elements of 5* 2 and 2 have disjoint supports. 
Therefore, any e-cover of S n< 2 must contain disjoint e-covers for 5* 2 for each i. Note also that 
Sf 2 is isomorphic to 5 n0j 2 for each i, and thus has minimal e-cover size at least (l/e) n ( log P/ e ". 
Therefore, any e-cover of S n must have size at least |_ n / n oJ ■ (l/e)^ 10 ^ 1 / 6 ^ = n(l/e)^i log i 1 //e h. 

□ 


4.2 Cover Size Lower Bound for /c-SIIRVs In this section, we prove our cover lower bound 
for fc-SIIRVs: 


Theorem 4.5. For 0 < e < e 12 (2fc) 9 and n < [77 log(l/e)J, there is an e-packing of S n y- under 
d^v with cardinality (l/e) n fo fc ). 


Proof. We consider fc-SIIRVs close to the (k — 1) multiple of the 2-SIIRV Po with parameters 
Pi = we used for the explicit lower bound in Section 14.11 Let m € Z + and 0 < <5 < 1 
be parameters that will be fixed later. Given an a € [m] n ^ k ~ 2 \ which will index by ay, for 
i £ {1,..., n} and j £ {1,..., k — 2}, we define a /c-SIIRV P a as follows. For each i, we take a 
Ai-IRV Yi with pdf defined as follows: 


Pr[P 4 = 0 ] 


Pr \Yi = j] 


Vi) 



5-aij, l<j<k-2, 


Pr [Yi = k- 1] 


Pi 



For convenience, we will denote 7 a> j = ^1 — 6 ■ Yj a ijj ■ We claim that the set of distributions 

P a , a € [m] n ( fc_2 ), is an e-packing. To prove this statement we proceed similarly to the proof of 
Theorem rm For a distribution P, we will consider the expectations 


rp,ij = J2 p i l ( pi ~ 1 ) ,p (^( /c_ !) +3) 

1=0 

for i £ {1,..., n} and j £ {1,... , k — 2}. Similarly to Claim 14.31 we have the following: 

Claim 4.6. Let P, Q £ S n j- such that dpy (P,Q) < e. Then for any i 6 {1,... ,n} and j £ 
{1 ,..., fc — 2}, we have that 

r Q,ij\ ^ 2 e. 
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Proof. We have the following sequence of (in)equalities: 


l r P ,ij r Q,ij\ ~ 


< 


E(P('(*= -1) + j) - Q(f(fc - 1) + -1)' 


1=0 

n 


E i(p(((* -1) +j) - q m - 1)+j)i • - 1)‘ 


i =0 


< Q (Z(fc - 1) + j)| < 2 d w (P, Q) 

i=0 

< 2e, 


where the second line is the triangle inequality and the third line uses the fact that \pf \pi — l) l \ < 1 
for all l € [n] and i € {1 ,..., n }. □ 

By the above claim, to complete the proof, it suffices to show that |rp a ,j — rp b . l? > 2e whenever 
a ij bij. To prove this statement, we exploit the fact that these fc-SIIRVs are close to a multiple 
of Po, by ignoring terms in the expectations that are 0{5 2 ). 

Let Y = Y™ Yi with Y ~ P a for a given a £ [m\ n ^ k ~ 2 ' 1 . We define several events depending 
on which coordinates Yi are equal to 0 or k — 1 , and consider their contribution to the expectation 
r P a ,*i separately. 

Firstly, let A> 2 be the event that more than one Yj is not 0 or k — 1 . The probability that any 
fixed Yi is not 0 or k — 1 is small, namely 


k—2 


k—2 


E Pr [Yi = j] = Y] daij <{k- 2 )mS . 
3 = 1 3 =1 


Hence, 

Pr[H> 2 ] < ((k — 2 )md) 2 < i • (n(k — 2 )m<5 ) 2 . 

The contribution of A> 2 to rp a)ij is r Pa) i ijj4 > 2 := YYaPi^Pi-rf^Y^F* [Y = l(k - 1) + j n A> 2 ] , 
and therefore 

\rp a ,ij,A> 2 \ < ^(n(k - 2 )md) 2 , 

since | p^~ l (pi — l) z | < 1 . 

Secondly, let Aq be the event that all Yfs are 0 or k — 1. If Aq occurs then Y is a multiple of 

k — 1. Thus, for l £ [n] and j € {1,_, A: — 2}, we have Pry^p a [T = l(k — 1) + j D Aq] = 0. The 

contribution of Aq to rp ai jj is 


rp a ,ij,A 0 ■= $2 p i l ( pi ~ 1 ) /p W~Pa i Y = K k -l)+jni O ] = 0. 

1=0 

Finally, for i £ {1,... , n}, let Bi be the event that Yi is the only fc-IRV that takes a value between 1 
and k—2. The probability of all other Yj l , with h i, being either 0 or k—1 is Y\h^i 7 a, h- We consider 
the RVs X_i = where Xh ~ Ber (ph)- That is, ~ P * £ S n -i j2 , i.e., it is a 2-SIIRV 

with parameters ph for h / i. Then, the conditional probability Pr Yh^i Yfl = Kk ~ l)|(-Bj U Hq) 
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is equal to Pr [7T_j = l] = P ~i(l) for all I £ [nj. So, for all l G [n] and j G {1,..., k — 2} we have 


Pr [P = — 1) + j PI -Bj] = Pr 


^2y h = l(k - l) n (Bi u A)) 

h^i 


Pr[y, = j] 


X\la,h ) P-i(l)Sa, 


V 


Then, the contribution of Bi to rp aj9 j is 

n 

rp a ,gj,Bi ■= J2 p1 9~ l ( p 9 ~ !) ipi V~P. [ y = l (k - 1) + j n Bi] 
z=o 

II7.ft ■5a ij -Y j p n g - l {pg-l) l P- i {l) 
h^i J 1=0 


1 7a ,h • Sa i:j • r P _ i (p g ) 


n 7a,ft I • Sa-ij X\iPh- Pg) , 

J h^i 

where rp_ i above is as defined in the previous section, and when g ^ i, the second product includes 
the term p g —p g = 0, so rp aigjt s i = 0. Summing these contributions to the expectation rp a; jj gives: 

n 

Ta ,ij = ?’P a ,ij,A>2 + 7 ’Pa ,ij,A 0 + ^2 T a ,ij,B g 

9=1 

= >'Pa,ij,A > 2 + rp^ij'Bi 

= rp a ,ij,A> 2 + II 7a,ft ' 5ai i ■ 

h^i h^i 

Now consider a and b which for some i G {1, 2,..., n} and j G {1,2 ,k — 2} have 7 ^ bij. We 
have that ]~[ h^i \Ph ~ Pi\ > e~ 3n by Equation (| 8 j) . and thus, 

H 7a, ft = ( 1 ~ <5 X) a hj ) > (1 - - 2)m<5 ) n_1 > (1 - (n - l)(k - 2)m5), 

h^i h^i \ 3 / 

|dij - bij | > 1 , and |r Pai y j 2 4 > 2 | < \{n(k - 2)m5) 2 . 

We obtain the following sequence of inequalities: 


kPa ,ij ~ r Pb,ij\ = l r Pa ~ rp b ,ij,Bi + rp a ,ij,A >2 - rp b ,ij,A> 


> 


n {Ph Pg) ( n 7a ,hSaij - n 7b ,hSbij) - (n(k - 2)m5) 2 

h^i h^i h^i 

> e~ 3n \ n 7a,ft5| • \a-ij - hj\ - e~ 3n 5bij\ n 7a, ft - II 7b,ft| - 00 - 2)m5) 2 
h^i h^i h^i 


> e 3n {l - (n- 1 )(k - 2 )m5)5 - 6ml |l - [] 7a,ft| + |l - II 7 b,ft| ) ~ 00 - 2)m8) 

' h^i h^i 

> e~ 3n 5 — 2 5mn(k — 2 )md — 2(n(k — 2 )m5) 2 

> e~ 3n 5 — 3 00 — 2 )m5) 2 . 
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Recall that by assumption e < e~ 12 (2k)~ 9 . We set n = [j^log(l/e)J, <5 = 3e 3 / 4 , and m = 
\- 2 n i {k- 2 )' 1 J• Then, e~ 3n 5 > 3e and 3 (n(k — 2)m5) 2 < e. So, we have that |rp a ^ — rp bi jj| > 2e as 
required. Also, 7a > 1 — > 0, so the £>IRVs are indeed well-defined. 

Therefore, we have exhibited a set of rn n ( k ~ 2 ^ fc-SIIRVs that have pairwise total variation 
distance at least e. The proof follows by observing that m n ( k ~ 2 ^ = (l/e)^( fclogl / e ). □ 

5 Sample Complexity Lower Bound 

In this section, we prove our sample complexity lower bounds. We start with the case k = 2, and 
then generalize our construction for an arbitrary value of k. As mentioned in the introduction, our 
sample lower bounds make crucial use of a geometric characterization of the space of USIIRVs. 
In Section [5.11 we describe our geometric characterization for 2-SIIRVs, and in Section [5.21 we use 
it to prove our 2-SIIRV sample lower bound. Similarly, in Section 15.31 we describe our geometric 
characterization for £;-SIIRVs, and in Section 15.41 we use it to prove our /c-SIIRV sample lower 
bound. 

5.1 A Useful Structural Result for 2-SIIRVs In this subsection, we prove a novel structural 
result for the space of 2-SIIRVs (Lemma 15.11) . This allows us to obtain a simple non-constructive 
lower bound on the cover size of 2-SIIRVs under the Kolmogorov distance metric. More importantly, 
this lemma is crucial for our tight sample complexity lower bound of the following subsection. 

Before we state our lemma, we provide some basic intuition. The set of all distributions sup¬ 
ported on [n] is n-dimensional (viewed as a metric space). Note that each P £ S n p is defined by 
n parameters. It turns out that S n p, is also n-dimensional in a precise sense. This intuition is 
formalized in the following lemma: 

Lemma 5.1. (i) Given any P £ S H) 2 with distinct parameters in (0,1), there is a radius 5 = <5(P) 
such that any distribution Q with support [n] that satisfies di<(P,Q) < 5 can also be expressed as 
a 2-SIIRV, i.e., Q £ S n 

(ii) Let Po £ S n< 2 be the 2-SIIRV with parameters pi = 1 < i < n. Then any distribution 

Q with support [n] that satisfies di<:(Po> Q) < 2~ 9n is itself a 2-SIIRV with parameters qi such that 
\qi~Pi\ < Jfn+T)' 

Proof. We consider the space of cumulative distribution functions (CDF’s) of all distributions 
of support [n]. Let T n be the set of sequences 0 < x\ < X 2 < ... < x n < 1. Consider the 
map V n : T n —>• T n defined as follows: For p = (pi,... ,p n ) £ T n (i.e., with ordered parameters 
0 < pi < ■ ■ ■ < p n < 1), let P be the corresponding 2-SIIRV in S Hj 2 . For i £ {l,...,n}, let 
(Pn(p))i = P( < i). Namely, V n maps a sequence of probabilities to the sequence of probabilities 
defining the CDF of the corresponding 2-SIIRV. 

The basic idea of the proof is that the mapping V n is invertible in a neighborhood of a point p 
with distinct coordinates. This allows us to uniquely obtain the distinct parameters of a 2-SIIRV 
P £ S n .2 from its CDF. We will make essential use of the inverse function theorem for V n , which 
we now recall: 

Theorem 5.2 (Inverse function theorem [Rud76| ). Let F : S —>• S C M n , be a continuously 
differentiable function and x be a point in the interior of S such that the Jacobian matrix of F, 
Jac(-F)(x) ; is non-singular. Then there exists an inverse function, F _1 , of F in a neighborhood of 
Ffx). Furthermore the inverse function F ~ 1 is continuously differentiable and its Jacobian matrix 
satisfies Jac(F _1 )(F(x)) = (Jac(F)(x))^ 1 . 
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We will apply the inverse function theorem for V n at the point p defining the distinct parameters 
of the 2-SIIRV P in the statement of the theorem. It is easy to see that V n is continuously 
differentiable. The main part of the argument involves proving that the Jacobian matrix of V n at 
p, Jac('P n )(p), is non-singular. 

Recall that Jac(P n )(p) is the n x n matrix whose (i, j) entry is the partial derivatives of (V n )i 
in direction j, i.e., (Jac(P n )(p ))ij = . We start by showing the following lemma: 

Lemma 5.3. For a 2-SIIRV P G 5 n ,2 with parameters p, we have 


M(p) • Jac('P„)(p) 


-diag n<* - pj ) 




(9) 


where M{ p) is the nx n matrix with entries ( M(p))ij = (1 ~ PiV 1 p]' 3 , 1 < i,j < n. Here, for 
x G W 1 , we denote by diag(x) the diagonal matrix with entries (diag(x))jj = Xj. 

Proof. To calculate the partial derivative , we isolate the effect of the parameter pj from 

the other variables. In particular, for X ~ P, i.e., X = -W, with JQ ~ Ber(pj), we can write 

X = X_j + Xj, where X_j = Note that Xj ~ P j G S n -\ t r 2 , i.e., it is the (n — 1) 

parameter 2-SIIRV with parameters pi for i j. Now, for 1 < i < n, we can write 


ifPn (p))« = p(< i) = P -j(< (i - 1)) + (1 - Pj)P-j(i - !)■ 

The derivative of this quantity with respect to pj equals = —P -j(i — 1). Therefore, the 

j-th column of Jac(P n )(p) equals —1 times the pdf of the distribution P j. This allows us to 
consider multiplying on the right by Jac('P„)(p) as taking the expectations of certain distributions. 
In particular, for y G W 1 and any 1 < j < n, we have that 

n 

(y T Jac(P n )(p))j = ~^2yiP-j(i - 1) = -E [yx_ j + 1 ] • 

i= 1 


Therefore, for 1 < i. j <n, we can write 


(M(p) • Jac(P n )(p))jj 


n 


- Viri, - 


Pi - irvr*p-j(fc -1) = -e (r -1) 

k=1 




E 

n (Pi-i) Xk Pi~ Xk 

— n> 

(Pi - 1 ) Xk p]~ Xk 


k+j 

k+j 



- n i(pi - i )pk +mi - Pk)]=~Y[{Pi~Pk) ■ 

k^j k^j 


Note that for i / j, the above product contains the term ( pi —pf) and so is equal to 0. When i = j , 
we have (M(p) • Jac('P„)(p))ij = — T\k=ti(Pi ~ Pk) completing the proof of the lemma. □ 

We are now ready to prove part (i) of Lemma l5.ll To this end, consider a 2-SIIRV P with distinct 
parameters p, i.e., pi pj for i j, such that pi G (0,1) for all i. Note that p lies in the interior 
of T n . Moreover, for all i, we have JI, h (.Pi ~ Pj) 7^ 0 an d therefore the matrix diag ~ Pj)) 
appearing in ([9]) is non-singular. It follows from Lemma f5.3l that both matrices on the LHS of ([9]) are 
non-singular. In particular, Jac('P n )(p) is non-singular, hence we can apply the inverse function 
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theorem. As a corollary, there exists an inverse mapping P” 1 in some neighborhood of P n (p). 
Specifically, there is some 5 > 0 such that P” 1 is defined at every x£f„ with ||x — Pn(p)||<x> < 

Let Q be a distribution over [n] satisfying cIk(P, Q) < 6. Equivalently, if y = (Q(< *))"_ 1 € T n 
is the CDF of Q, then ||P n (p) — y||oo < 5. Thus Pp 1 is dehned at y and q = V~ 1 (y) € T n are 
the parameters of a 2-SIIRV with distribution Q. Thus, Q is a 2-SIIRV with parameters q, which 
completes the proof of (i). Note that the proof also implies that Q in this neighborhood can be 
taken to be P n (q 7 ) for q 7 in some small neighborhood of p. 

To prove part (ii) of Lemma 15.II we use a geometric argument. Recall that the parameters of 
P 0 are p 0 = ..., Let S C T n be the set of vectors p with ||p - p 0 ||oo < ■ % 

Lemma [4. 2 1 we have that any Q in V n (dS) satishes (Ln/(Po> Q) > 4 ( n +i) ’ an d therefore di<:(Po! Q) > 

e~ 3n 's. o—9 n 
8(n+l) 2 — 

Let B be the set of distributions Q on [n] so that cIk(Po) Q) < 2 9n . We claim that V n (S)r\B = 
B. To begin, note that S is compact, and therefore this intersection is closed. On the other hand, 
since V n {dS ) is disjoint from B, this intersection is 'P„(int(S')) 0 B. On the other hand, since V n 
has non-singular Jacobian on int(S’), the open mapping theorem implies that P n (int(S’)) 0 B is an 
open subset of B. Therefore, V n (S) O B is both a closed and open subset of B, and therefore, since 
B is connected, it must be all of B. This completes the proof of part (ii). □ 

As a simple application of our structural lemma, we obtain a non-constructive lower bound on 
the cover size under the Kolmogorov distance metric: 

Theorem 5.4. For any e > 0 and n = D(log(l/e)) any e-cover of S Ht 2 under dx must have size at 
least n ■ (l/e^dogLAO). 

Proof. Note that by an argument identical to that of Corollary 14.41 it suffices to prove a packing 
lower bound of (1/e) r2 ( lo s( 1 / e )) for n = 0(log(l/e)). 

To that end, fix n = no = |_^ log 2 (l/e)J. Then, we have 2 -9n > yfe. By Lemma I5TTT ii). there 
is a 2-SIIRV Po £ <S n ,2, such that any distribution Q with support [n] and g?k(Po,Q) < \fe is in 
S n 2 - We will give an e-packing lower bound for this subset of 2-SIIRVs. 

Let us denote by z £ T n the vector defining the CDF of Po, i.e., z = (Pq(< i))f = i- Let S C M n 
be the set of points x £ M n with ||x — < yfe. Note that S is an n-cube with side length 2y/e. 

We claim that every x £ S is the CDF of a 2-SIIRV Q £ S n ,2 ■ By Lemma 15.11 this follows 
immediately if x £ T n , i.e., if x is the CDF of a distribution. So, it suffices to show that S CI„. 
Suppose for the sake of contradiction that there is a point y £ S \ T n . Then, there is a point x £ S 
such that x lies on the boundary of T n . For such a point x, one of the inequalities 0 < x\ < X 2 < 

• • • < x n < 1 is tight. Thus, x is the CDF of a distribution Q which has Q(i) = 0 for some i. Since 
x £ S C T n , Q is a 2-SIIRV with parameters given by Lemma 15.11 In particular Q does not have 
any parameters equal to 0 or 1. Thus, we have Q(i) > 0 for all i £ [n], a contradiction. 

Therefore, any e-cover of <S ni 2 in Kolmogorov distance induces an e-cover of the same size in 
Lqo distance of the CDFs of distributions in S n ^- If s is the size of such a cover, then we have s 
n-cubes of side length e whose union contains S. Recall that S is an n-cube of side length yfe. The 
volume of each of these s n-cubes is (2e) n and the volume of S is (2y/e) n . The volume of the union 
of s n-cubes is at most s ■ (2e) n and hence s ■ (2e) n > (2y / e) n or s = (l/e) n ^ n \ which competes the 
proof. □ 

5.2 Sample complexity lower bound for 2-SIIRVs In this subsection, we prove our tight 
sample lower bound for learning 2-SIIRVs. Our proof uses a combination of information-theoretic 
arguments and the structural lemma of the previous subsection. In particular, we show: 
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Theorem 5.5 (Sample Lower Bound for 2-SIIRVs). Let A be any algorithm which, given as input 
n, e, and sample access to an unknown P € 5 n ,2 outputs a hypothesis distribution H such that 
E[dxy(H,P)] < e. Then, A must use 12(( 1/e 2 ) • ^/log(l/e)) samples. 

Our main information-theoretic tool to prove our lower bound is Assouad’s Lemma |Ass83j . 

We recall the statement of the lemma (see, e.g., |DG85| L tailored to discrete distributions below: 

Theorem 5.6. [Theorem 5, Chapter 4, [D085 ]/ / Let r > 1 be an integer. For each b £ {—1, l} r , 
let Pb be a probability distribution over a finite set A. For l < £ < r and b £ {—1, l} r , we denote 
by b^’+) (resp. b ^ ,_ )) the vector with b,^’ +) = bj (resp. b^’ * = hi) fori i and b = 1 (resp. 
b f-> = -y. Suppose there exists a partition Aq, A ±,..., A r of A such that for all b € {—1, l} r 
and all 1 < l < r, the following inequalities are valid: 

(a) J2xGA e l P bh.+>0*0 - Pb^-ofa)! ^ a ’ and 

( b ) 'Zx&A V P b^+) ( x ) P b^-)( x ) > 1 - 7 > 0 . 

Then, for any any algorithm A that draws s samples from an unknown P € Pb and outputs a 
hypothesis distribution H, there is some b £ {— 1, l} r such that if the target distribution P is Pb, 

E[d w ( p ,H)] > (ro/4)(l - y/2^). 

Recall that 2-SIIRVs are discrete log-concave distributions. We will use the following basic 
properties of log-concave distributions: 

Lemma 5.7. There exists a universal constant c > 0 such that the following holds: For any log- 
concave distribution P supported on the integers and standard deviation a, there exist at least 12 (< 7 ) 
consecutive integers with probability mass under P at least c ■ . 

Proof. Note that if a < 1, taking the mode trivially satisfies this property. 

Without loss of generality we can assume that 0 is the mode of P. We know that Y^xeZ x2p ( x ) = 
0(<j 2 ). Let = Yh x >o x2p (. x )- Let be the largest integer so that P (4 + 1)/P(4) < e i//t+ . We 
note that 

OO 

^x 2 P(x) <^2x 2 P(t + )e-^- t+)/t + = 0(4P(O)), 

rr>0 :r=0 

and 

t+ 

X; x2p (x) > p(4) J> 2 = 0 (4 p (°))- 

rr>0 x=0 

Also note that 

00 

X P W < X] p (t+)e- (a - t + )/t + = 0(4 p (O)). 

a:>o 1=0 

Similarly, defining cr_ andt_, we find that a 2 = Q(a‘^_+a 2 _) = 0(P(O)(t^_+ti)). Thus, max(t + , t_) 3 P(0) 
0(cr 2 ) and max(t + ,t_)P(0) = 12(1). Without loss of generality this maximum is t+. Note that 
for all 0 < x < t + that P(x) = 0(P(t + )). This implies that 4 P (0) = 0(1), and thus, by the 
above is 0(1). Therefore, it follows by the variance bounds that t+ = 12(cr 2 ), so 4 = 0(cr). Hence, 
x = 0,1,... ,4 are 12(a) terms on which the value of P is 12(1/4) = Li(l/a). This completes the 
proof. □ 
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We are now ready to prove Theorem 15.51 

Proof of Theorem \5.5[ Ideally, we would like to use the set of 2-SIIRVs whose parameters are 
explicitly described in Theorem l4.1l in our application of Assouad’s lemma. Unfortunately, however, 
this particular set is not in a form that allows a direct application of the theorem. The difficulty lies 
in the fact that it is not clear how to isolate the changes between distributions in disjoint intervals 
using explicit parameters. 

We therefore proceed with an indirect approach making essential use of Lemma I5.1f ii). We 
start from the 2-SIIRV Po in the statement of the lemma and we perturb its pdf appropriately 
to construct our “hypercube” distributions Pb- The lemma guarantees that, if the perturbation is 
small enough, all these distributions are indeed 2-SIIRVs. 

Observe that the variance of Po is f l(n) since f l(n) parameters pi he in [1/4, 3/4]. By Lemma 
15.71 there exist r = £l(y/n) consecutive integers, an integer m, 0 < m < n, and a real value t with 
t > c- r, such that for all i, with m < i < m + 2r, we have 

p W 2 1 • 

For n sufficiently large, we can assume that 2 _9n < c and therefore j > 2 9 . 

We are now ready to define our “hypercube” of 2-SIIRVs. For b £ {—l,l} r , consider the 
distribution Pb with 


Phfi) = 


if i < m, i > m + 2 r, or = 


Po« 

Po (<) - 2= 

Po(*) + ^ 7 — if b, = 1 and i is odd 


= -1 


if b 




= 1 and i is even 


Note that all these distributions are 2-SIIRVs as follows from Lemma l5.1f iil since 

d K (Pb,Po) < dw(Pb,Po) = 2" 9n . 

For 0 < i < r — 1, the sets A l+ \ = {m + 2i, m + 2i + 1} define the partition of the domain. We can 
now apply Assouad’s lemma to this instance. 

For b £ {—1, l} r we can write 


l P b(^+)(^) - P b(A-)(z)l = 

X€lAj£ 


2-2 


—9 n 


Similarly, 


n / 

Y (\/Pb(*.+>w - \/Pb^-)(*) 

i=0 '■ 


> 


E 

i=m-\-2£,m-\-2£-\-l 

£ 

i=m-\-2l,m-\-2£-\-l 

£ 

i=m+2l,m+2i+l 
2-iSn . c 


P b(^+>(*) - P b(^-)(*) 


, V ,p b^.+)W) + VPb^-iW, 

2~9n/r 

, \/Pb( £ -+)(*)) + V / Pb(^-)(0 / 
/ 2 -9n/ r \ 2 

\Vi Tt) 


2 r 
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where the first inequality uses the fact that 


2~ 9n 211 
Pb(i) > p o(i) — - f “ 7 - f ’ 

for m < i < m + 2k. 

Therefore, the parameters in Assouad’s Lemma are 


2 . 2 — 2~ 1Sn 
a := -, 7 = 


r 


-, and s = — 

2 r 87 


from which we obtain that that there is a Pv, with 


E [<hv( H, P b )] > (ra/ 4) • (1 - y^r) = 


>—9 n 


Hence, for e = 2 9n 2 , if the number of samples satisfies 

0(2 18n y / n) = O ((l/e 2 ) V / bi(lM) , 


1 r • 2 18n 

s < — =- 

87 4c 


then E [c2tv(H, P a )] > e, completing the proof of the theorem. 


□ 


5.3 A Useful Structural Result for /c-SIIRVs In this subsection, we prove the analogous 
structural result to Lemma 15.11 for /c-SIIRVs. 

Proposition 5.8. Let k > 2 be a positive integer and e < 1/poly (k) be sufficiently small. Let n 
be a sufficiently small multiple of log(l/e). Define P to be the k-SIIRV given by X ^ p such that 
X = 57=1 1 where Xi(j) = pij , and for l<i<n, l<j<k — 2we have that 


Pi,j = 1/(3(A; - 2 )n),pifl = 1/3 + (i - l)/(3n),pi, k -i{k - 1) = 1/3 + (n - i)/(3n). 


Then, if Q is any distribution supported on [ n(k — 1)] with dTV"(P)Q) < e, then Q is a k-SIIRV. 

Proof. The basic idea of the proof will be topological. We note that the dimensionality of the 
parameter space of n-variate fc-SIIRVs is the same as the dimensionality of the space of random 
variables of appropriate support size. Our result will follow from the following lemma: 

Lemma 5.9. Let qij (1 < i < n, 0 < j < k — 1) be a sequence of positive real numbers with 
57=o Qi,j = 1 f or ea °h *■ ^ k-SIIRV defined by the qij. Suppose that maxjjdpjj — q'jjl) = 

e 2 / 3 . Then, d TV {X,Y) > e. 

Proof. Let I and J be one of the pairs of integers such that we achieve \ppj — qi,j\ = e 2 / 3 . X has 
probability generating function X(z) = E[z A ] = n/=i A i(z ). We start with the following claim: 

Claim 5.10. Assuming n is sufficiently large, the roots of Xi(z) = 0 satisfy 


/ rt 4- T \ 
i(l+2l)/(k-l) ( n + 1 \ 

V2 n-IJ 


<0(l/(k-l)n), 


for 0 < l < k — 2. Also, \zj\ < 3 for all 1 < j < k — 1. 
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Proof. Specifically we claim that when n > 200, there is a root within distance 33/(k — 1 )n. 

Consider the polynomial fi(x) = (1/3 + (n — I) / [3n))x k ~ l + (1/3 + I/(3ri)). Then, fi(x) = 0 
has roots x = ag, where 

„ _ m(l+2(.)/(k-l) ( n + I 
' \2n — I 

for 0 < £ < k — 2. Note that Xj(x) = //(x) + ^^”^®- ? /(3(A: —2)n) —l/3n. Also, for any 1 < k < k— 1, 
we have ^ < \a 3 e \ < 2. 

We will show that for any y € C with |y — ag\ = 33/{k — 1 )n, it holds |A/(y)| > |A"/(a£)|, and 
therefore there is a root of Xj(x ) with — a^l < 33/(k — 1 )n. We have 


\ i/(fc-i) 


k—2 

\X I (ag)\ = | fi(ag) + ^2ae/(3(k - 2)n) - a£/3n| < 2|a£|/3n < 4/3n. 
i=i 

Now we consider fi(x) expressed as a polynomial in w = x — ag. We claim that this is dominated 
by the w term when |rc| = 33 /{k — 1 )n. We show that, under certain conditions, the binomial series 
is dominated by its first two terms: 

Claim 5.11. If m\x/b\ < 1/3, then |(6 + x) m — b m — (m — l)x6 m_1 | < (m — l)\xb m ~ 1 \/2. 

Proof. By the binomial theorem (6 + x) m = (™) x' J b rn ~ 3 . Note that the ratio of the absolute 

values of the x 3+1 and x’ J terms is 


Thus, 



x S+l b m-j-l 


/ 



(m — j)j(j + 1) • \x/b\ < 1/3 . 


\{b + x) m -b m -(m-l)xb m - 1 



m— 1 

1)| xb™-^ Y, < (m 

3 =1 


l)\xb m ~ 1 \/2. 


□ 


When |tc| = 33/(k — 1 )n, we have ( k — l)|(u;/a/)| < 66/n < 1/3, and therefore 
fi(w + aj) = (1/3 + (n — I)/(3n))(w + aj) fc_1 + (1/3 + I/(3n )) 


satisfies 

| fi{w + ai) — (1/3 + (n — I)/{3n))(k — l)rcaj _1 | < (1/3 + (n — I)/{3n)){k - l)\wa k I ~ l \/2. 

Since |(1/3 + (n — I)/{3n)){k — l)|u>aj” 1 |/2 > 33/12n, and so \fi(w + a/)| > 33/12n. 

Now we have that \fi(y)\ > 33/12n and fi(ag) = 0. We also have 

k—2 k—2 

I(A/(y) - = \^y 3 /{3{k - 2n) - l/3ra| < ^(M + 33/(A - l)n) 3 /{3(k - 2)n) + 1/3 n. 

3 =1 3 = 1 

By Claim [57TT1 on (|ai| + 33/(fc — l)n) J , we have that 

(|ai| + 33/(fc - 1 )n) j < |ai| j + 3j|ai|33/(fc - l)2n < 2 + 99 j/{k - 1 )n < 3. 
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So. 


We have 


k—2 

1/3 n + ^(|a 7 | + 1/n.y/(3(k — 2 )n) < 1/n + l/3n = 4/3 n. 

3 = 1 


> |/j(y)| - \(Xi(y) - My))\ > (33 - 16)/12n > 4/3n > |X/(a<)|. 

Since this holds for all y € C with |y — af\ = 33/(k — 1 )n, it follows that there is a zg € C with 
— a 7 | < 33 /(k — 1 )n. 

Finally, note that since \zg — ag\ < 33/{k — l)n < l/6(fc — 1), for any 1 < j < k — 1, we have 
(j — 1)(^ — ag)/ag <1/3 and so by Claim [5TTT1 

1 4 ~~ a { - O' ~ 1 )(^ ~~ a O a £ _1 | < 10 - !)(^ - aO a ^ _1 |/ 2 < 1/6. 

Thus, 1^1 < \a J i \ + 1/2 < 3. □ 

Our lemma will follow easily from the following claim: 

Claim 5.12. For some l, we have that \ X(zg) — Y(zg) \ > e 5//6 . 

Proof. Note that for each i, since \zg\ J < 3 for all 1 < j < k — 1, 

\Xi(zg) — Yi(zg)\ < 3e 2/3 < e 1 / 2 /n . 

Furthermore, note that for i / / that \Xi(zg)\ = 0(|z — I\/n). This implies that 

J\Y i {zg) = 2° (n) . 

i^i 


However, we have that 


Xi(zg) = 0 


for all l. 

It suffices to show that \Yg{zg)\ > e 3 / 4 for some l. Let Zk-i = 1. By standard polynomial 
interpolation, we have that 

£=0 


Z — Zn 


Zj 

\3 7 ^* J 


Similarly, 


fc-i 


w 7 (z)=^x / (^) m 


z —^1 


£=0 




In order to make use of this, we need to bound the size of the coefficients of the polynomial 

are 0(1). 


n j^£ j ■ 

Claim 5.13. For any £, we have that all coefficients of ^]/[ 


j^£ ze-zj 
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Proof. Let 


Q( z ) = XM = = ( 1 + 2 1"~ 8) ) n‘(* - **)• 

J=0 V ' j =1 

Firstly, for ^ = fc — 1 the polynomial in question is Q(z)/Q(l) = Q(z), which clearly has coefficients 
of size 0(1). For £ < k — 1, the polynomial in question is 

Q(z)(z - 1) 

{z- z e )(Q'(ze))(zi - 1 )' 

It should be noted that (Q'(zg))(zg — 1) = 11(1) and that multiplying a polynomial by z — 1 
at most doubles the size of its maximum coefficient. Therefore, it suffices to consider the poly¬ 
nomial Q(z)/(z — zg). In order to analyze this, we write 1 /{z — zg) as a power series P(z) := 
— z rn /z'f l+1 . We note that the polynomial in question is the product of Q(z) times this power 
series. We note that we need only consider the first k terms of this product since terms of degree 
more than k cancel. Noting that the first k coefficients of P{z) are all 0(1) and that the coefficients 
of Q(z) have absolute values summing to 1, implies that the first k coefficients in their product are 
all 0(1). This completes the proof. □ 

Therefore, the largest coefficient of Xi(z) — Yi(z) is at most 

0(l)J2\xi(ze)-Yi(ze) • 

i 


Recall that this largest coefficient is e 2 / 3 by assumption. Therefore, for some £ we must have that 

X T (zg) - Yj(zg )| > n(e 2 / 3 /k) > e 3 / 4 . 

On the other hand, we have that 

Xi(zk-i) = Yj(zk-i) = 1 , 

and so for some other £ we must have that \Yj-(zg)\ > e 3//4 . Noting that 

X{zg) = 0 , 


and 

Y(zg) > 2°^e 3/4 > e 5/6 . 

This proves the claim. 

The lemma now follows from the fact that 


n(k— 1 ) 

X{zg)-Y{zg) |= ]T z m \X(m) - Y(m)\ 

m =0 

n(k— 1 ) 

< 2 0(n) \X(m) — Y{m) 

m =0 

= 2 °^d TV {X,Y). 


□ 


□ 
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Note that Lemma 15.91 actually applies for any Y and Z with all parameters of Z within e 2 / 3 of 
those of X , and the parameters of Y of distance 5 from Z, that d^y(Y,Z) > e l ^5. This implies 
that the derivative of the map F : —>■ M n ( fc_1 ) from parameters of (n, fc)-SIIRVs close to X 

to probability distributions on [n(k — 1)] is everywhere injective (if DF(Z) had some null-vector v, 
then F(Z + 8v) would be F(Z) + o(5), which is a contradiction). Therefore, by the Inverse Function 
Theorem, F is an open map. 

Let B\ be the set of parameters of fc-SIIRVs within e 2 / 3 in L°° of those of X. Let B 2 be the 
set of distributions on [n(k — 1)] within e of X. Let V = F(B 1 ) n B 2 . On the one hand since 
B\ is compact, this must be a closed subset of B 2 . On the other hand, Lemma 15.91 implies that 
V = .F(Int(-Bi)) n B 2 , which is an open subset of B 2 , since F is an open map. Therefore, V is both 
an open and closed subset of I? 2 - Since B 2 is connected, this implies that V = B 2 . Thus, every 
element of B 2 is in the image of F, and is thus a fc-SIIRV, proving Proposition 15.81 □ 


5.4 Sample complexity lower bound for fc-SIIRVs In this subsection, we prove our general 
sample lower bound against fc-SIIRVs: 


Theorem 5.14 (Sample Lower Bound for fc-SIIRVs). Let A be any algorithm which, given as 
input n, k > 2, e < 1/poly (k), and sample access to an unknown P G S n ^ outputs a hypothesis 
distribution H such that E[<Lrv(H,P)] < e. Then, A must use Ll((k/e 2 ) • yTog(l/e)) samples. 

In addition to the structural result of the previous subsection, we also need to prove an analogue 
of Lemma 15.71 which does not immediately apply, as /c-SIIRVs need not be logconcave. In fact, we 
remark that Lemma E3 does not apply to the fc-SIIRVs used in the lower bound construction of 
Section m So, we need to use a slightly different construction. 


Lemma 5.15. For the k-SIIRV P defined in Provosition \5.8l there exist Q((k — l)y/n) consecutive 
integers with probability mass under P at least Ll( 

Proof. We wish to reduce this claim to Lemma 15.71 which gives that there are universal constants 
c > 0 such that for any PBD Q with standard deviation a, there are at least I)(<r) consecutive 
integers with probability mass at least c ■ ■ 

Recall that P is the fc-SIIRV given by X ~ P such that X = Jf/ =1 Xi , where Xfij) = pij 
and for 1 < i < n, 1 < j < k — 2, we have that ptj = 1/(3 (k — 2)n), p t .Q = 1/3 + (i — l)/(3n), 
Pi : k-i(k — 1) = 1/3 + (n — i)/(3n). So, we have that Pr[Xj = 0 V Xf = k — 1] = 1 — l/3n for all i. 

Let Aq be the event that all X % are equal to 0 or k— 1. Then, Prjylo] = (1 — 1/3 n) n = 11(1). Let 
Y = X/{k — 1) and Y) = Xi/(k — 1). Conditioned on the event Aq, each T) is a Bernoulli random 
variable and Y is a PBD Q. Note that Var[T | ^4o] > n(l/3-2/3) = 2n/9 = D(n). So, by Lemma l5?7l 
we have that there are integers a, b, with a—b = Ll(y/n) such that Q (h) > ’ ^ or eac ^ integer 

a < i < b. Since the probability of Aq is f2(l), it follows that any integer h G [{k — l)o, (k — 1)6] 
with h = 0 (mod k — 1) has Pr[X = h] > D(^) > D(^—2y-^=). 

For a given 1 < i < n, let Bi be the event that only Xj takes a value between 1 and k — 2. 
Then, the conditional distribution of Yi under either Aq or Bi is a PBD Q-j, which is 

the same in both cases. Now, Y = Y_i + Y % and conditional on Aq, Yi is a Bernoulli for any integer 
h, so either Pr[YLj = h \ To] > Pr[T = h|To]/2, or Pr[YLj = h — 1 | To] > Pr[T = h \ To]/2. In 
particular, Q(a) > Pl(l/y/n) and Q(6) > Lt(l/y/n), so it follows that either Q_j(a) > D(l/y / n) or 
Q_i(a — 1) > D(l/y / n) and either Q -fib) > D(l/y / n) or Q_j(6 — 1) > Ll{\/yfn). However, as a 
PBD, Qis unimodal, and it follows that for every integer a < h < 6—1, Q ~i(h) > 0.(1/y/n). Now, 
consider an integer (k — l)a < h < (k — 1)6 with h yk 0 (mod k — 1). We can write h = q(k — 1) + r 
for integers a < q < 6 — 1 and 1 < r < k — 1. Note that Pr [Xi = r\Bf\ = 1 /(k — 2), since we are 
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conditioning on it not taking the values 0 or k — 1. Then PrpT = h \ Bf\ = Pr[Y_j = q \ = 

r | Bi] = 0(1 /y/n) ■ l/(k - 2) = 0(1 /((k - 1 )y/n). 

For each 1 < i < n, Pr[Sj] = (1 —l/3n) n_1 -l/3n = 0(l/n). So, consider any integer {k — l)a < 
h < (fc-l)fr. If h # 0 (mod jfe-1), Pr[X = h}> Pr [ X = hAB i\ = ELi Pr [ X = h\Bi]Pi[Ai] = 
Y^i= iO(l/((A: — 1 )y/n) ■ 0(1 /n) = 0(l/((fc — 1 )y/n). When h = 0 (mod k — 1), we showed earlier 
using A 0 that 0( ^_ 1 1 ^^.^ ). This holds for ( k — l)(o — b) > ( k — l)(0(y / n) — 1) = O ((k — 1 )y / n) 
consecutive integers. □ 

The proof of Theorem 15. 141 using Assouad’s Lemma is now almost identical to that of Theorem 

ESI 


Proof of Theorem \5.1J\ Let P be the fc-SIIRV defined in 15.81 Let C be a constant large enough 
that Proposition 15.81 implies that all distributions Q with <Liv(P,Q) < 2~ Cn are fc-SIIRVs. 

By Lemma 15.151 there exists some c > 0 and r = 0((fc — 1 )y/n) consecutive integers, an integer 
m, 0 < m < n, and a real value t with t > c • r, such that for all i, with m < i < m + 2r, we have 

p M £ ! ■ 

For n sufficiently large, we can assume that 2 _Cn < c and therefore j > 2 ° . 

We are now ready to define our “hypercube” of /c-SIIRVs. For b € {—l,l} r , consider the 
distribution Pb with 


Pb» 


'Po(i) 
< PoW 
Po(0 


2~ Cn 

r 

2~ Cn 

r 


if i < m, i > m + 2r, or = —1 

if = 1 and i is even 

if b Li(i-m)j = 1 and * is odd 


Note that Proposition 15.81 yields that all these distributions are fc-SIIRVs since 


d K(P b .Po) < dw(Pb-Po) = 2~ Cn . 


For 0 < i < r — 1, the sets A l+ i = {m + 2i, m + 2i + 1} define the partition of the domain. We can 
now apply Assouad’s lemma to this instance. 

For b € { —1, l} r we can write 


y: iPb(^+)( a; ) - p b(^-)( 

xGA( 


2 • 2 


-Cn 


-)(X = 


Similarly, 


i=o 


(v p b^+)(*) - v p b(«.-)(*)) = 


p b(^+>(*) - p bt^-)(*) 


i=m-\-2£ : m-\-2£-\-l 


\/Pb(*.+)(*)) + \/Pb(^-) (*) 


E 


2 ~ Cn / r 


i=m-\-2l,m-\-2t-\-l 


Vp b «.+>M) + 


> 


U-Cn /r \ 2 


i=m-\-2l,m-\-2l-\-l 
2~ 2Cn ■ c 


2 r 
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where the first inequality uses the fact that 


2 -Cn 211 

P b (*) > Po«- 

r t t t 

for m < i < m + 2k. 

Therefore, the parameters in Assouad’s Lemma are 


2 . 2 —Cn 2~ 2Cn 

a := --, 7 = 


-, and s = — 

2 r ’ 87 


from which we obtain that that there is a Pv, with 


E [d TV ( H, P b )] > (m/4) • (1 - y/2^) = 


-y—Cn 


Hence, for e = 2 Cn 2 , if the number of samples satisfies 

0(2 2Cn (k - l)v^) = O [(k/e 2 ), 


1 r ■ 2 2Cn 

s < — =- 

87 4c 


then E [c2tv(H, P a )] > e, completing the proof of the theorem. 


□ 
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Appendix 

A Basic Facts from Probability 

Definition A.l. Let /r € R, a S M-°. We let (T 2 ) denote the discretized normal distribution. 
The definition of Z ~ Z(ix,a 2 ) is that we first draw a normal G ~ N(/z,u 2 ) and then we set 
Z = [G]; i.e., G rounded to the nearest integer. 

We begin by recalling some basic facts concerning total variation distance, starting with the “data 
processing inequality for total variation distance”: 

Proposition A.2 (Data Processing Inequality for Total Variation Distance). Let X, X' be two 
random variables over a domain P. Fix any (possibly randomized) function F on P (which may be 
viewed as a distribution over deterministic functions on Ll) and let F(X) be the random variable 
such that a draw from F(X) is obtained by drawing independently x from X and f from F and 
then outputting f{x) (likewise for F{X')). Then we have drv (F(A), F{X')) < (Ln/(A, X'). 

Next we recall the subadditivity of total variation distance for independent random variables: 

Proposition A.3. Let A, A !, B , B' be integer random variables such that {A, A') is independent of 
( B,B '). Thendr V (A + B,A' + B') < d TV (A, A') + d TV (B, B'). 

We will use the following standard result which bounds the variation distance between two 
normal distributions in terms of their means and variances: 

Proposition A.4. Let n \, H 2 and 0 < o\ < Then dTv(N(/zi, erf), N(// 2 , erf)) < ^ 

B Lower Bounds on Matching Moments 

We start by giving an explicit example of two PBDs over k + 1 variables that agree exactly on the 
first k moments and have total variation distance . 

Proposition B.l. Let P,Q 6 <Sfc+i ,2 be PBD’s with parameters pi = (1 + cos(^j))/2 and qt = 
(1 + cos(^ptp))/2 respectively, where 1 < i < k + 1. Then P and Q agree on their first k moments 
and have dTv(P, Q) > 4 -fc . 

Proof. Let X = X t , where A* are independent Bernoulli variables, and suppose that X ~ P. 

We note that, for m < k, the random variable X m can be expressed as a degree m polynomial 
in the Xfs. Therefore, the m-th moment of P is a degree m symmetric polynomial of the pfs. 
Similarly, the m-th moment of Q must be the same symmetric polynomial of the q % . Therefore, 
to show that the first k moments of P and Q agree, it suffices to show that the first k elementary 
symmetric polynomials in the pi have the same values as the corresponding polynomials of the Qj’s. 

Note that the pi are the roots of T^+ i(2x— 1) — 1 and that the qi are the roots of Tk+i(2x — 1) + 1, 
where Tfc+i is the (k + l)-st Chebychev polynomial. Therefore, for m < k, the m-th elementary 
symmetric polynomial in the pi is [x k+1 ^ m ](—l) m 2~ 2k ~ 1 T^ + i(2x + 1) and the same holds for the 
qi. Thus, the first k moments of P and Q agree. To bound the total variation distance from below 
we observe that 

fc+i 

II Pi = p ( k + 1) = [x°](-l) fc+1 2- 2fe - 1 (T fe+ i(2x + 1) - 1), 


n 9i = + !) = [x°](-l) fc+1 2- 2fc - 1 (T fc+1 (2x + 1) + 1). 

i= 1 
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Therefore, the probability that P = k + 1 and the probability that Q = k + 1 differ by 4 k . This 
implies the appropriate bound in their variational distance and completes the proof. □ 


We also show that matching moments does not suffice for the case of /c-SIIRVs, even for k = 3: 

Proposition B.2. For n an even integer, there exist P,Q 6 <S n / 2 ,3 with disjoint supports such that 
their first re — 1 moments agree. 


Proof. We first show that there exist such P and Q with P supported on even numbers and Q 
supported on odd numbers, so that 

P( 2 j) = 2 -” +1 ^, 

and 

QW + l) = 2 -»«( 2 ." +1 ). 

We begin by showing that P G S n / 2>3 . Since 2 _n+1 ( 2 ‘-) = 1> we show that the polynomial 

P(a) = 2~ n+1 factors as a product of n/2 quadratic polynomials with non-negative 

coefficients. To prove this, we note that it suffices to show that all roots of P are pure imaginary; 
then, the natural factorization into quadratics using complex conjugate pairs will complete the 
argument. For this, we observe that P(z) = 2 _ri ((l + z) n + (1 — z) n ). Therefore, 2 is a root of P 
only when |1 + z\ = |1 — z\, or when a is equidistant from 1 and — 1 , which happens only when the 
real part of 2 is 0, i.e., when 2 is pure imaginary. 

Similarly, we show that Q € S n / 2 , 3 . Once again 2 _n + 1 ( 2 ^ 1 ) = an d so we merely need 

to show that Q(*) = Ei2 - n+1 { 2j n +1 )z 2j+1 factors into quadratics with non-negative coefficients. 

Since Q(a) = 2 _n ((l + z) n — (1 — z ) n ), it also has only purely imaginary roots. 

It remains to show that P and Q have identical first re — 1 moments. For this, it suffices to 
show that P(a)^(l) = Q(z)( fc )(l) for all 0 < k < re. Indeed, we have that 


P(z)( k \l)-Q(z)( k \l) = 2 1 ~ 


dz k 


(l-z) n \ z=1 = 


> 1 —n 


(1 -z) 


n-fc | 


(re — k ) 


=1 = 0. 


\z= 1 


This completes the proof. 


□ 


C Omitted Proofs from Section [2] 

C.l Bootstrapping Our Sampler The running time of the sampler described in Section [2.31 
has an O(logre) dependence. In this subsection, we show that the dependence on re can be easily 
removed, by dealing separately with the case that the variance is P(poly(fc/e)). In particular, we 
have the following algorithm, which is similar to the Learn-Heavy routine from [DDO + 13] , 

Lemma C.l. There is an algorithm with the following performance guarantee: For any e > 0 
and X G S n) k with Var[X] = f2(poly(/c/e)), the algorithm draws 0{k/e 2 ) samples from X, runs in 
0(k 2 /e 2 ) time, and with high constant probability outputs a distribution cZ + Y, where 1 < c < k, 
Z is a discrete Gaussian, and Y is a c-IRV, with d^v^X, cZ + Y) < e. 

Proof. By Theorem 13.11 there is a 1 < c! < k such that the discrete Gaussian Z' with parameters 
Epf]/c' and Var[X]/c /2 and the c'-IIRV Y' := X (mod P) satisfy dTv{X, dZ' + Y') < e. 

We start by guessing c. For each guess for c, we learn the appropriate Y and Z. Finally, we 
run a tournament over the possible values of c. Fix 1 < c < k. To learn Y, we first draw @(c/e 2 ) 
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samples and let X' be the resulting empirical distribution. Then, we take Y = X' (mod c). To 
learn Z , we take 0(l/e 2 ) samples from X and calculate the empirical mean and variance, J1 and 
(7 2 . Then, we let Z be the distribution obtained by sampling from A f(Jl/c,a 2 /c 2 ) and rounding the 
sample to the nearest integer. 

Suppose that c = d. By standard facts, we have dyy(Y,Y') = d^y(X' (mod c),X (mod c)) < 
e/4 with high probability. Also, with high probability, we have (1 — e/4)cf 2 < Var[X] < (1 + e/4)d 2 
and |E[X] —J2\< cfe/4. By a combination of Propositions IA.2I and IA.4l we have that d^y(Z, Z') < 

\ ( |E ^ff ]l + |Var vadzj r[Z/1 ' ) ^ £ / 4 ' Thus ’ we have dTy ( Y + cZ ' Y ' + cZ ') ^ d TV (Y,Y') + 
d TV (Z , Z') < e/2, and therefore d TV (X, Y + cZ) < d TV (X, Y' + cZ') + d TV (Y + cZ, Y' + cZ') < e. 

In summary, we have k different hypothesis distributions Y c + cZ c , for each 1 < c < k, one 
of which is promised to satisfy dTy(X,Y c + cZ c ) < e. We can now run a standard tournament 
procedure jDLOll DDS15 ] that produces a hypothesis with dyy(X,Y c + cZ c ) < 0(e) with high 
probability. This requires 0(logA:/e 2 ) samples and can be easily done in 0(fc 2 /e 2 ) time. □ 

We thus obtain the following corollary: 

Corollary C.2. For all n,k € Z + and e > 0, there is an algorithm with the following performance 
guarantee: Let X £ S n y be an unknown k-SIIRV. The algorithm uses 0(k log 2 (fc/e)/e 2 ) samples 
from P, runs in time 0(&; 3 /e 2 ), and with probability at least 9/10 outputs an e-sampler for X. This 
e- sampler produces a single sample in time 0(k). 

Proof. First we take 0(1) samples and estimate the variance of X. If the variance is fl(poly(fc/e)), 
we use the algorithm given by Lemma lC.il to output a distribution cZ + Y, where 1 < c < k, Z is 
a discrete Gaussian and Y is a c-IRV, with dj-y (X, cZ + Y) < e. Note that cZ + Y can be sampled 
in time 0(k). 

If the variance is 0(poly(fc/e)), we use Learn-SIIRV. This produces a distribution H given by 
its DFT modulo M = 0(poly(fc/e)) at 0{k \og(k/e)) points. By Theorem 12.61 we can compute an 
e-sampler which produces a single sample in time 

0(log(M) log(M/e) • 151) = 0(log 2 (fe/e) • k \og{k/e)). 


□ 


C.2 A Bound on the 1/2-norm of fc-SIIRVs 

Lemma C.3. The 1/2-norm of a k-SIIRV P with variance a 2 is 0(a + k). 

Proof. Recall that ||P||i /2 = (Si yP^)) 2 ’ Let // be the mean of X ~ P. By Cauchy-Schwartz, for 
any S C [kn], we have SieS V^ii) < y / P(5) • |5|. By Bernstein’s inequality, for any e > 0, it 
holds Pr[|X — p\ > (k + a) log(l/e)] < e. Therefore, we can write 

OO 

E v^) + E E 

i \[i—i\<cr+k m=0 (a-\-k)2 rn <\fi—i\<2 rn + 1 (a-\-k) 

oo 

<V^Tk + E 2\/ffT ~k • 2 m/2 VPr[|A' -p\> (a + k)2 m ] 

m —0 
oo 

<V a + k + 2 \/a + k ■ 1 = 0(Va + k) . 

m =0 

□ 
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D Omitted Proofs from Section [3] 

D.l Proof of Lemma 13.21 For convenience, we restate Lemma 13.21 

Lemma 13.21 Let P G S Ut k be a k-SIIRV with Var[X] = V. For any 0 < 5 < 1/4, there exists 
Q € S n) k with d w (P,Q) = 0(8V ) such that all but 0(k + V/8) of the k-IRV’s defining Q are 
constant. 

Proof. For a fe-IRV A let m(A) be an index i so that Pr[d = i] is maximized. Let d(A) = Pr[d 
m(A)\ be the probability A assigns to values in [fc] \ {i}. Suppose that d(A) < 1/2. Then we have 
that 


d(A)/2 < (1/2) • Pr (A A') < (1/2) • E[\A - A' | 2 ] = Var[d] < E[|A - m(A) | 2 ] < k 2 ■ d(A), 

where A! is an independent copy of A. The leftmost inequality follows from our assumption that 
d(A) < 1/2. The proof of the lemma will make repeated applications of the following claim: 

Claim D.l. Let A,B be independent k-IRV’s with m(A) = m(B) and d(A) + d(B) < 1/2. Then 
there exist independent k-IRV’s C and D, where D is a constant, d(C) = d(A) + d(B), and 
d TV {A + B,C + D) = 0(d(A)d(B)). 

Proof. Let m(A) = m(B ) = i. Let d(A) = 8 ±,d(B) = 62 ■ Let A' be the random variable A 
conditioned on A not equaling i, and B' be the random variable B conditioned on it not equaling 
i. Note that A is a mixture of i and A! and B a mixture of i and B'. Furthermore A + B equals 2 i 
with probability (1 — di)(l — 82 ), i + A' with probability <5i(l — 82 ), i + B' with probability (1 — 81)82 
and A 1 + B' with probability <5 i<52- 

Let D be the random variable that is deterministically i and C be the random variable that 
equals i with probability 1 — <5i — 82 , A! with probability <5i, and B' with probability 82 - Then 
C + D equals 2i, i + A', i + B' and A! + B' with probabilities 1 — 8 \ — 82 , 81 , 82 , and 0. These 
probabilities are within an additive ^ 1^2 of the corresponding probabilities for A + B and therefore 
dTv(A + B,C + D) = 0 ( 8182 ). Note that C = i with probability 1 — <5i — 82 , so d(C) = di + 82 , 
which completes the proof. □ 

For a random variable X ~ P, we have that X = Y^i=iAi where the .Aj’s are independent 
/c-IRV’s. We iteratively modify P as follows: If two of the non-constant component /c-IRV’s of P 
are A and B, with m(A) = m(B) and d(A),d(B) < 8 , then we replace the pair A and B with 
the pair C and D as described by the above claim. Notice that every step reduces the number of 
non-constant component variables, and therefore this process terminates, giving a fc-SIIRV Q with 
for Y ~ Q. Y = 5Xi B,. 

By construction, for each 1 < i < k, Q has at most one non-constant component variable with 
m(Bj) = i and d(Bj) < 8 . Claim OTTI implies the sum of the d’s of the component variables does 
not increase in any iteration, and therefore 

n n n 

d(Aj) < 2^Var[A,-] = 2Var[X] = 2V , 
j = 1 i =1 i =1 

where the second inequality uses the aforementioned lower bound on the variance of a &;-IRV. Hence, 
the number of non-constant component variables in Q is at most k + 2 V 8 _1 . 

It remains to show that dTu(PrQ) = 0(8V). Let A, B and C,D be the /c-IRV’s of Claim iDTl 
Then d TV (A + B,C + D) = 0(d(A)d(B)) = 0([d(C ) 2 + d(D) 2 } - [d(A ) 2 + d(B) 2 ]). That is, the 
total variation distance error introduced by replacing A, B by C, D is at most a constant times the 
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amount that the sum of the squares of the d’s of the component variables increases by. Repeated 
application of this observation combined with the sub-additivity of total variation distance gives 
dTv(P, Q) = O d(Bj ) 2 — J2j L = i d (Aj) 2 ^ • On the other hand, note that all of the Bfs that 

are not also Aj satisfy d(Bj) < 26. Therefore, we have that dxv(P, Q) < O (^2j d(Bj )<25 d (Bj) 2 ^) = 

O ($Ylj = 0(6V) , which completes the proof. □ 

D.2 Proof of Lemma 13.51 For convenience, we restate Lemma 13.51 

Lemma 13.51 Fix x 6 C with |x| = 1. Suppose that p\,...,p m are roots of P(x) (listed with 
appropriate multiplicity) which have \pi — x\ < t4. Then, we have the following: 

(i) |P(s)| < 2~ m . 

(ii) For the polynomial q(x) = P{x)/Y\ r f =1 {x — pf), we have that |<?(a;)| < k m . 

To prove our lemma, we will make essential use of the following simple lemma: 

Lemma D.2. For any polynomial p{x) € C[x] of degree d where the sum of the absolute values of 
the coefficients of p is at most 1, we have the following: Fix z € C with \z\ = 1. Suppose that p has 
roots pi,..., pm with \pi — z\ < for i G {1,..., m}. Then, the following hold: 

(i) \p(z)\ < 2 ~ m , 

(ii) for the polynomial q(x) = p(x)/ YYiLi( x ~ Pi) we have that |(?(z)| < d m . 

Proof. The lemma is proved by repeated applications of the following claim: 

Claim D.3. Let p(x) € C[x] be a degree-d polynomial such that the sum of the absolute values of 
the coefficients of p is at most 1. Let p be a root of p(x) and q(x) be the polynomial Then, 

the sum of the absolute values of the coefficients of q is at most d. 

Proof. We write the coefficients of p(x) and q(x) as p(x) = P-i/x 1 and q(x) = Yli=o Tx 1 ■ Since 
p(x) = (x — p)q(x), for 1 < i < d — 1, we have 

Pi = %_i - pqt , (10) 


and similarly p d = q d _ i, p 0 = -pq 0 . 

We consider two cases based on the magnitude of p. First, suppose that |p| < 1. Since q d -i = Pd 
and, by (fTUl) . g,;_i = Pi + pq t , for 1 < i < d — 1, an easy induction gives that q, = Yl d j=i+i Pjf^~ 1 ^ 1 
for 0 < i < d — 1. Summing and taking absolute values gives: 




d— 1 d d i —1 

< EE = 

z=0 j=i +1 i =1 j =0 

d d 

< ^2 I Pi\i < d^2 \Pi\ < d ■ 

2=1 2=1 


Second, suppose \p\ > 1. Then, j^j- < 1. We have qo = —|po and by (fTQl) . for 1 < i < d — 1, 
Qi = p(Qi -1 — Pi)- By an easy induction, for 0 < i < d, q: = — J=o PjfFl■ Summing and taking 
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absolute values gives: 


< 2-1 


X>i 


d -1 * I d ~l i 

^ EEWot = E(nE]tpt=l 

2=0 j=0 1,1 2=0 j=2 ln 


< 2-1 


< 2-1 


< y~] \pi\(d -l-i) < \pi\ < d . 


2=0 


2=0 


□ 

By repeated applications of the claim it follows that the polynomial q(x) has the sum of the 
absolute values of its coefficients at most d m . Since \z\ = 1, it follows that \q(z)\ < d m which gives 
(ii). To show (i) we note that 


IpMI = l?WI • n I* - «l £ l?WI ■ (1/2<0™ < 2-™ . 

i— 1 

This completes the proof of Lemma ID. 21 □ 

Proof of Lemma [X5l Note that P(x) is the degree n[k— 1) polynomial defined by P(x) = J2[ l =o ^ Y{i)x l . 
Note that the sum of the absolute values of P’s coefficients is 1. However, to apply Lemma ID. 21 
directly to P we would need the roots to be at distance at most 2n (fc-i) • 

Note that P(x) factors as niLi Vi( x ), where Pi{x) = E[x Xi ] is a degree k — 1 polynomial 
that is determined by the i-th fc-IRV. It is clear that the coefficients of Pi(x) are non-negative 
and sum to 1, hence we may apply Lemma lD.21 to Pi{x). Suppose that pi(x) has roots with 
|pi — x\ < Lemma rP.2l i) implies that | pt(x)\ < 2~ mi . Since P(x) = YYi = \Pi{x), this yields part 
(i) of Lemma 13.51 

Lemma fD.2lf ii^> implies that the polynomial qi(x) = Pi{x)/\\j eS .{x — Pj ), for Si C {1,... ,m} 
with |5j| = mi, satisfies \qi{x)\ < k mi . Note that q(x) = I^iLi Qi( x )- Therefore, \q(x)\ < = 

k m , giving part (ii) of Lemma 13.51 □ 


D.3 Proper Cover Construction for the High Variance Case. Exhausting over the k — 1 
possible values of c, we can assume that c is known to the algorithm. Before proceeding further, 
we will need further structural information about the fc-SIIRVs in this case. We start with the 
following simple lemma giving an upper bound on the total variation distance between two high 
variance fc-SIIRVs: 


Lemma D.4. For e > 0, let X, X' be k-SIIRVs with Var[X], Var[X'] > poly (k/e) for a sufficiently 
large poly(fc/e) that have cItv{X,Y + cZ) < e and dTv(X',Y' + cZ') < e for c-IRVs Y,Y' and 
discrete Gaussians Z,Z', with E[X] = cE[Z], Var[V] = c 2 Var[Z], E[X'] = cE[Z'\ and VarpC] = 
c 2 Var[Z / ], Then we have that 


d TV {X, X') < 4e + d TV (X (mod c), X’ 


(mod c)) + 


1 |E[V] - E[V']| 

2 ^XwfX] 


1 |Var[V] - Var[V']| 
+ 2 Varpf] 


where X (mod c) is the c-IRV with Pr[X (mod c) = i\ = Pr[V = i (mod c)] for i G [c]. 
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Proof. Using Proposition IA.21 since cItv(X,Y + cZ) < e with Y = Y + cZ (mod c), we have 
(Itv (X (mod c),F) < e. Similarly, (1tv (X' (mod c),F') < e. By a combination of Proposi¬ 
tions [A]2] and [S3J we have that d^y{Z 1 Z') < | ^ • Since E[X] = cE[Z], 

Var[V] = c 2 Var[Z], E[X'] = cE[Z'] and Var[X'] = c 2 Var[Z'] it follows that 


\E[Z\ - E[Z']| |Var[Z] - Var[Z']| 


y/Va^M 


Var[Z] 


\E[X] -R[X']\ |Var[V] - Var[X']| 


y/V^\X] 


Varpf] 


Therefore, 


drv(Y + cZ,Y'+ cZ') < d TV (Y,Y') + d TV (Z, Z') 

< 2e + dyv (X (mod c), X' (mod c)) + 

1 f \E[X]-E[X']\ |Var[V]-Var[W]| \ 

2 ^ /Va^X] Var[X] J ' 

By another application of the triangle inequality, we have that dTv(X, X') < dyy{X,Y + cZ) + 
dTv(Y + cZ. Y' + cZ') + dyv(Y' + cZ',X') < 2e + dxv(V + cZ,Y' + cZ ), which completes the 
proof. □ 

To use the above lemma, we need a way to characterize the constant c in the statement of 
Theorem 13.11 namely to show that the theorem applies to both X and X' for the same value of 
c. For a &-IRV A, let m(A) be an index i so that Pr[A = i] is maximized. The following result is 
implicit in the proof of Theorem 13.11 in DDO + 13] (in particular, in Theorem 4.3 of that paper): 


Lemma D.5 f [DDO + 13| ). Given a k-SIIRV X = )U ra = , X \ with Var[V] > poly (k/e), let T~L he the 
set of integers b such that Y17 =i — m{Xi) = c] > Q(k 7 /e 2 ) and c = gcd (fH). Then there is a 

c-IRV Y and a discrete Gaussian Z with dxy(X, Y + cZ) < e. 

Let X € S n j. be a A:-SIIRV with Var[X] > poly(A:/e) as in Case 2 of Theorem 13.11 Our main 
claim is that, up to e error in total variation distance, we can assume that X has a special structure. 
In particular, we can take all but one of the component IRVs of X to be constant modulo c, with the 
last one being a c-IRV. More formally, we claim that there is a fc-SIIRV X' with d^y{X,X') < e, 
such that X' = X[ with 

• For 1 < i < H, where H = Q(k 7 /e 2 ), X[ is either 0 or c each with equal probability. 

• For 1 < i < n — 1, X] is constant modulo c. 

• X' n is a c-IRV. 


where c is as in Lemma ID. 51 

We can construct such an X' from X as follows. For 1 < i < H, we replace X,; with the X[ above 
that is 0 or c with equal probability. For H + \ <i <n — 1, we replace each X,; by X % conditioned 
on the event that Xi (mod c) = m(Xj) (mod c). Finally we take X' n to be (X — Y^i=\ X'f) (mod c) 


noting that X^=i X[ (mod c) is a constant. 

We now show that the above procedure only changes the expectation and variance by |E[X] — 
E[X']| < poly(fc/e) and |Var[X] — Var[X']| < poly(fc/e). Note that for two arbitrary /c-IRVs, A and 
B, we have that |E[A] — E[2?]| < k and |Var[A] — Var[2?]| < k 2 . Thus, 


H 


H 


|E[X n + £ Xi] - E[X + £ X']| < (H + 1 )k < poly (k/e) 

2—1 i =1 
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and 


|Var[X n + £ Xi] - VarpC + £ X']| < (H + 1 )k 2 < poly (k/e). 

i=l i=1 

For the remaining variables H + 1 < i < n— l,we have X-) < Pr[Xj — m(X,.) ^ 0 (mod c)] 

and so \E[Xi]-E[X']\ < kPr[Xi-m(Xi) ^ 0 (mod c)] and |Var[xj-Var[X']| < F 2 Pr[X i -m(X i ) ^ 
0 (mod c)]. For each integer 0 < b < k — 1 that does not divide c, by Lemma ID.51 we must have 
that b ^ Pi and hence £/ =1 Pr[Xj — m(Xj) = 6] = 0(k 7 /e 2 ). Thus, £” =1 Pr[X* — m(Xi) ^ 0 
(mod c)] = 0 (k s /e 2 ). 

If Var[X] is a sufficiently large poly (k/e), then Var[X'] is large enough that we can apply 
Theorem l3.1l and Lemma lDd)l to X'. Note that £” =1 Pr[|Xj-—m(X?)| = c] > £^ =1 Pr^X?—m(X')| = 
c] = H/2. We thus have that either c € H or — c € PL. Since for b that does not divide c, we have 
£/ =1 P r [X( — m(X') = b] = Pr[X/ — m(X' n ) = b] < 1 and thus b (H), we have that gcd (PL) = c. 
Thus, for X with sufficiently large poly (k/e) variance, we have that drv(X, Y + cZ) < e/10 and 
d^v(X\Y' + cZ') < e/10 for the same 1 < c < k — 1 and c-IRVs Y. Y' and discrete Gaussians 
Z, Z'. In conclusion, we can apply Lemma ID. 41 to X and X'. We have that X' (mod c) = X' n = X 
(mod c). We have shown that E[X] — E[X'] < poly (k/e) and Var[X] — Var[X'] < poly(l/e). If 
Var[X] is a sufficiently large poly(fc/e) then we can make the contributions of each of these to 
dTv(X, X') in Lemma lD.41 smaller than e/10. Then we have d^viX, X') < e. 

Since every £>SIIRV X in Case 2 is e-close to an X' of the aforementioned form, to compute a 
proper cover for this case, we can consider only fc-SIIRVs of the form stated above. By a similar 
argument as above, our cover only needs to ensure that the triple of X (mod c),E[X], Var[X] is 
sufficiently close to any such triple achievable by an element of S n & of this form. Obtaining a cover 
of X (mod c) is easy, as we only need to deal with the single term X n that is non-constant modulo 
c, and produce a cover for c-IRVs. Indeed, it is straightforward to produce such a cover of size 
0 (k/e) k . 

As explained in Section [TT! we have an explicit cover for the discrete Gaussian random variables 
that can appear in this setting. However, we are left with the difficulty of producing an explicit 
fc-SIIRV approximating one of these c times a discrete Gaussian whenever such an approximation 
is possible. Fortunately, we note that we only need to be able to approximately match the mean 
and the variance. Note that as above, the H = poly(fc/e) components that we are requiring to be 
either 0 or c, and the one that is a c-IRV can be assumed to have negligible effect on the final mean 
and variance if we had a sufficiently large poly (/c/e) threshold for the variance. 

Let C be the largest multiple of c that is at most k — 1. Let S n ^ !C be the set of /c-SIIRVs on 
n components all of which are constant modulo c. For a given a > poly (/c/e) and n we need to 
determine whether or not there is an element of S n ^ c whose mean and variance match fi and a to 
within ecr, and if so to produce one. To do this, we first need a couple of observations about which 
H, a are attainable. 

Observation D.6. For P € S n ^ )C j Varv~p[X] < nC 2 /A. 

Proof. This is because any /c-IRV that is constant modulo c has a distance of at most C between 
its minimum and maximum values, and thus has variance at most C 2 / 4. □ 

Observation D.7. For P € 5 n ,fc,c an d X ~ P, i/E[X] < nC/ 2, then Var[X] < CE[X] — E[X] 2 /n. 

Proof. We note that in the range in question the quantity CTE[X] — E[X] 2 /n is increasing in E[X], 
and therefore, we may show that for any given achievable variance the minimum possible expecta¬ 
tion satisfies this inequality. Note that for the minimum achievable expectation, we may assume 
that each of the component IRVs is deterministically 0 modulo c, since otherwise we could subtract 
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a constant from it, which would decrease the expectation and leave the variance unchanged. The 
observation now follows given that for any fc-IRV, Y that has Pr[Y (mod c) = 0] = 1 it holds 
Var[y] = E[V 2 ] - E[y] 2 < CE[Y] - E[Y} 2 . □ 


Observation D.8. For P G S nt k, c an d X ~ P, if E[X] > n(k — 1) — nC / 2, then Var[X] < 
C(n(k - 1) - E[X}) - (n(k - 1) - E[X]) 2 /n. 

Proof. This follows from the previous observation by considering the random variable n(k — 1) — 

X. □ 


We now claim that any pair of expectation and variance /r and a 2 not disallowed by the above 
observations may be approximated by an explicitly computable element of S n ^ tC - Note that, by 
symmetry, we may assume that fj, < n{k — l)/2. If fj, > 2a 2 /C, we may make \Au 2 /C 2 \ < n of our 
IRVs either Xi or Xi+C with equal probability for some integers 0 < Xi < k — 1 and all other X % with 
H + 1 <i <n— 1 constant. By adjusting the Xi s and the constants, we can make the expectation 
of X satisfy |E[X] — n\ < 1 so long as /r > 2er 2 /C', and the variance Var[X] = C' 2 |_4cr 2 /C' 2 J satisfies 
|Var[X] - a 2 \ < 1. 

Otherwise, if ii < 2 cr 2 /C, let a 2 = C[i ■ q with 1 > q > 1/2. We then use a sum of fc-IRVs that 
are 0 with probability q and C with probability 1 — q, and some £;-IRVs that are deterministically 
0. If we have a many IRVs of the first type, then we get a mean and variance of E[X] = a(l — q)C 
and Var[X] = aq{ 1 — q)C. Letting a be approximately Var[X]/(g(l — q)C) completes the argument. 
We simply need to verify that in this case a < n i.e., that a 2 /(q(l — q)C) < n. Indeed, note that 


Var[X]/(g(l — q)C) 


Var[X] 

(Var[X]/(OE[X]))(l - (Var[X]/(CE[X])))C 


gE W 2 ^ n 

CE[X\ - Var[X] “ 


by Observation ID.71 This shows that given a discrete Gaussian, Z so that cZ approximates some 
element of 5 nj fc )C , we can efficiently find such an element. In Section [3.11 we gave an appropriately 
small cover of the set of such Gaussians, which consists of a grid of means and variances of size O(n). 
It is easy to construct such a grid and by the above, we can construct an X with |E[X] — cfi\ < 
poly (k/e) and | Var[X] — c 2 cr 2 | < poly (k/e) for each fi, a 2 in the grid that is not disallowed by 
our observations. Thus, we can efficiently find a cover of the elements of <satisfying Case 2 of 
Theorem 13.11 
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